Don’t Patronize Me! An Annotated Dataset with Patronizing and Condescending Language towards Vulnerable Communities

In this paper, we introduce a new annotated dataset which is aimed at supporting the development of NLP models to identify and categorize language that is patronizing or condescending towards vulnerable communities (e.g. refugees, homeless people, poor families). While the prevalence of such language in the general media has long been shown to have harmful effects, it differs from other types of harmful language, in that it is generally used unconsciously and with good intentions. We furthermore believe that the often subtle nature of patronizing and condescending language (PCL) presents an interesting technical challenge for the NLP community. Our analysis of the proposed dataset shows that identifying PCL is hard for standard NLP models, with language models such as BERT achieving the best results.


Introduction
In this paper, we analyze the use of Patronizing and Condescendig Language (PCL) towards vulnerable communities in the media. An entity engages in PCL when its language use shows a superior attitude towards others or depicts them in a compassionate way. This effect is not always conscious and the intention of the author is often to help the person or group they refer to (e.g. by raising awareness or funds, or moving the audience to action). However, these superior attitudes and a discourse of pity can routinize discrimination and make it less visible (Ng, 2007). Moreover, general media publications reach a large audience and we believe that unfair treatment of vulnerable groups in such media might lead to greater exclusion and inequalities.
While there has been substantial work on modelling language that purposefully undermines others, e.g. offensive language or hate speech (Zampieri et al., 2019;Basile et al., 2019), the modelling of PCL is still an emergent area of study in NLP. Some reasons for this might include that the use of PCL in the media is commonly unconscious, subtler and more subjective than the types of discourse that are typically targeted in NLP. Specifically, a special focus in PCL towards vulnerable communities has not been yet considered, to the best of our knowledge.
Within a broader setting, there has been some work on PCL which is concerned with the communication between two parties, where one is patronized by the other, such as in social media interactions. In particular, Wang and Potts (2019) recently published the Talkdown corpus for condescension detection in comment-reply pairs from Reddit. In this work, the authors highlight the difficulty of the task and the need for a high-quality dataset annotated by experts, which is the approach we take for studying PCL towards vulnerable communities.
To encourage more research on detecting PCL language, we introduce the Don't Patronize Me! dataset 1 . This dataset contains more than 10,000 paragraphs extracted from news stories, which have been annotated to indicate the presence of PCL at the text span level. The paragraphs were selected to cover English language news sources from 20 different countries, covering different types of vulnerable communities (e.g. homeless people, immigrants and poor families). We furthermore propose a taxonomy This work is licensed under a Creative Commons Attribution 4.0 International Licence. Licence details: http:// creativecommons.org/licenses/by/4.0/. 1 Available at https://github.com/Perez-AlmendrosC/dontpatronizeme. of PCL categories, focused on PCL towards vulnerable communities. Each of the PCL text spans from our dataset has been annotated with a category label from this taxonomy. Finally, we also provide some analysis of the dataset. Among others, we find that even simple baselines are able to detect PCL to some extent, which suggests that this task is feasible for NLP systems, despite the subtle nature of PCL. On the other hand, we also find that the considered models, including approaches based on BERT (Devlin et al., 2019), struggle to detect certain categories of PCL, suggesting that there is still considerable room for improvement. In particular, while some forms of PCL can be detected by identifying relatively simple linguistic patterns, many other cases seem to require a non-trivial amount of world knowledge.

Related Work
Condescending and patronizing treatment has been widely studied in various fields, such as language studies (Margić, 2017), sociolinguistics (Giles et al., 1993), politics (Huckin, 2002) or medicine (Komrad, 1983). Within NLP, there has been extensive work on several forms of harmful language, but this work has generally focused on explicit, aggressive and flagrant phenomena such as fake news detection (Conroy et al., 2015); trust-worthiness prediction and fact-checking (Atanasova et al., 2018;Atanasova et al., 2019); modeling offensive language, both generic (Zampieri et al., 2019), and geared towards specific communities (Basile et al., 2019); or rumour propagation (Derczynski et al., 2017). Recently, however, some work on condescending language has started to appear. For instance, Wang and Potts (2019) introduced the task of modelling condescension in direct communication from an NLP perspective, and developed a dataset with annotated social media messages. In the same year, Sap et al. (2019) discussed the social and power implications behind certain uses of language, an important concept in the unbalanced power relations that are often present in condescending treatment. Also related to unfair treatment of underprivileged groups, Mendelsohn et al. (2020) analyzed, from a computational linguistics point of view, how language has dehumanized minorities in news media over time.

Background on PCL
Research in sociolinguistics has suggested the following traits of PCL towards vulnerable communities: • it fuels discriminatory behaviour by relying on subtle language (Mendelsohn et al., 2020); • it creates and feeds stereotypes (Fiske, 1993), which drive to greater exclusion, discrimination, rumour spreading and misinformation (Nolan and Mikami, 2013); • it strengthens power-knowledge relationships (Foucault, 1980), positioning one community as superior to others; • it usually calls for charitable action instead of cooperation, so communities in need are presented as passive receivers of help, unable to solve their own problems and waiting for a saviour to help them out of their situation (Bell, 2013;Straubhaar, 2015); • it tends to avoid stating the reasons for very deep-rooted societal problems, by concealing those responsible or even, in some cases, by apportioning blame to the underprivileged communities or individuals themselves; • it proposes ephemeral and simple solutions (Chouliaraki, 2010), which oversimplify the wicked problems (Head and others, 2008) vulnerable communities face.
The use of PCL makes it more difficult for vulnerable communities to overcome difficulties and reach total inclusion (Nolan and Mikami, 2013).

How to identify PCL?
In this work, we analyze discourse on vulnerable communities. We will consider a piece of text as containing PCL when, referring to an underprivileged individual or community, we can identify one or several of the following traits: • The use of the language states the differences between the 'us' and the 'them'. The vulnerable community is depicted as different to us, with other experiences and life stories. This discourse establishes an invisible distance between the two communities.
• The language raises a feeling of pity towards the vulnerable community, for example by using (or abusing) adjectives or by recurring to flowery words to depict a certain situation in a literary way (i.e., metaphors, euphemisms or hyperboles).
• The author and the community they belong to are presented as saviours of those in need. Not only do they have the capacity to solve their problems, but also a moral responsibility to do so. The superior or privileged community is also presented as having the knowledge and experience to face and solve the problems of the vulnerable ones.
• In the opposite direction, the members of the vulnerable community are described as lacking the privileges the author's community enjoys, or even the knowledge or experience to overcome their own problems. They will need, therefore, the help of others to improve their situation.
• The vulnerable community and its members are presented either as victims (i.e. overwhelmed, victimized or pitied) or as heroes just because of the situation they face.

What is not PCL?
Precisely because we are studying the discourse towards vulnerable communities, it can be easy to classify a piece of text as condescending mistakenly. We want to highlight, in particular, the following two situations where the language that is used to talk about unprivileged groups is not condescending.
• Because they are experiencing vulnerability, the news about them often depicts rough situations. The description of an extreme situation can be harsh and stark and leave the reader with a feeling of sadness and helplessness, while not necessarily being condescending.
• With PCL, the superiority of the author is concealed behind a friendly or compassionate approach towards the situation of vulnerable communities. Thus, a message which is openly offensive, aggressive or containing prejudiced, discriminatory or hate speech is not considered to be PCL for the purpose of our dataset.

The Don't Patronize Me! dataset
The Don't Patronize Me! dataset currently contains 10,637 paragraphs about potentially vulnerable social groups. These paragraphs have been selected from general news stories and have been annotated with labels that indicate the type of PCL language that is present, if any. The paragraphs have been extracted from the News on Web (NoW) corpus 2 (Davies, 2013). To this end, we first selected ten keywords related to potentially vulnerable communities widely covered in the media and susceptible of receiving a condescending or patronizing treatment: disabled, homeless, hopeless, immigrant, in need, migrant, poor families, refugee, vulnerable and women. Next, we retrieved paragraphs in which these keywords are mentioned, choosing a similar number of paragraphs for each of the 10 keywords and each of the 20 English speaking countries that are covered in the NoW corpus. An overview of the number of paragraphs for each keyword-country combination can be found in Table 1. All the selected paragraphs come from news stories that were published between 2010 and 2018. The data was annotated by three expert annotators, with backgrounds in communication, media and data science. Two annotators annotated the whole dataset (ann1 and ann2), while the third one (ann3) acted as a referee to provide a final label in case of disagreements. An extended data statement (Bender and Friedman, 2018) about the corpus will be published together with the dataset.  Australia  56  51  52  56  57  57  54  54  60  55  552  Bangladesh  51  57  46  50  51  56  46  52  55  53  517  Canada  53  53  52  51  52  47  55  56  61  52  532  Ghana  62  55  57  56  51  58  25  53  54  55  526  Hong Kong  60  58  32  53  55  59  22  49  52  61  501  Ireland  61  49  55  58  58  58  36  58  48  55  536  India  53  52  62  60  57  52  52  58  59  50  555  Jamaica  53  62  47  56  58  51  11  54  50  51  493  Kenya  52  51  55  56  51  54  55  49  57  61  541  Sri Lanka  53  57  57  59  48  53  32  56  49  50  514  Malaysia  58  48  47  54  62  58  53  58  60  56  554  Nigeria  55  60  49  52  53  56  49  56  60  55

Categories of PCL towards vulnerable communities
For all text spans that were annotated as containing PCL, the annotators also provided a category label. This allows us to analyze at a finer-grained level to what extent NLP models are able to recognize the different traits of PCL. These labels might also make it easier to train NLP models for detecting PCL, for instance by treating them as privileged information during training (Vapnik and Vashist, 2009). Inspired by the characteristics of PCL discussed in Section 3, we have used the following seven categories, which we grouped into three higher-level categories.
• The saviour. The community which the author and the majority of the audience belong to is presented in some way as saviours of those vulnerable or in need. The language used subtly positions the author in a better, more privileged situation than the vulnerable community. They express the will to help them, from their superior and advantageous position. There is a clear difference between the we and the they. As part of this category, we can find examples of the following subcategories: -Unbalanced power relations. By means of the language, the author distances themselves from the community or the situation they are talking about, and expresses the will, capacity or responsibility to help them. It is also present when the author entitles themselves to give something positive to others in a more vulnerable situation, especially when what the author concedes is a right which they do not have any authority to decide to give. (i.e. 'You can make a difference in their lives' or 'They come back in with nothing and we need to outfit them again' or 'They deserve another opportunity' or 'They also have the right to love'). -Shallow solution. A simple and superficial charitable action by the privileged community is presented either as life-saving/life-changing for the unprivileged one, or as a solution for a deep-rooted problem.
(i.e. 'Raise money to combat homelessness by curling up in sleeping bags for one night' or 'If every supporter on Facebook donated just one box each it would make a real difference to many poor families').
• The expert. The underlying message is that the privileged community, which the author and their audience belong to, knows better what the vulnerable community needs, how they are or what they should do to overcome their situation. We consider the following subcategories: -Presupposition, when the author assumes a situation as certain without having all the information, or generalises their or somebody else's experience as a categorical truth without presenting a valid, trustworthy source for it (e.g. a research work or survey). The use of stereotypes or clichés are also considered to be examples of presupposition.
(i.e. '[...] elderly or disabled people who are simply unable to evacuate due to physical limitations' or 'If the economy fills with women, it will develop beautifully'); -Authority voice, when the author stands themselves as a spokesperson of the group, or explains or advises the members of a community about the community itself or a specific situation they are living.
(i.e. 'Accepting their situation is the first step to having a normal life' or 'We also know that they can benefit by receiving counseling from someone who can help them understand.'); • The poet. The focus is not on the we (author and audience), but on the they (the individual or community referred to). The author uses a literary style to describe people or situations. They might, for example, use (or abuse) adjectives or rhetorical devices to either present a difficult situation as somehow beautiful, something to admire and learn from, or they might carefully detail its roughness to touch the heart of their audience. The subcategories we establish are: -Metaphor. They can conceal PCL, as they cast an idea in another light, making a comparison between unrelated concepts, often with the objective of depicting a certain situation in a softer way. For the annotation of this dataset, euphemisms are considered as an example of metaphors.
(i.e. 'Poor children might find more obstacles in their race to a worthy future' or 'those who cling to boats to reach a shore of survival'); -Compassion. The author presents the vulnerable individual or community as needy, raising a feeling of pity and compassion from the audience towards them. It is commonly characterized by the use of flowery wording that does not provide information, but the author enjoys the detailed and poetic description of the vulnerability; (i.e. 'Some are lured by corrupt "agents", smuggled across the searing Sahara and discarded in the streets of Europe, resigned to selling fake designer bags as undocumented immigrants' or 'For the roughly 2,000 migrants who call it home, the broken windows and decaying walls of the decrepit warehouse offer scant respite from the harsh blizzard conditions currently striking Serbia').
-The poorer, the merrier. The text is focused on the community, especially on how the vulnerability makes them better (e.g. stronger, happier or more resilient) or how they share a positive attribute just for being part of a vulnerable community. People living vulnerable situations have values to admire and learn from. The message expresses the idea of vulnerability as something beautiful or poetic. We can think of the typical example of 'poor people are happier because they don't have material goods'.
(i.e. 'He is reminded of the true meaning of hope by people living in situations the world would see as hopeless' or 'her mom is disabled and living with her gives her strength to face everyday's life' or 'refugees are wonderful people') Finally, in the dataset, we also included an "Other" category, to classify all the text spans which the annotators considered to contain PCL, but which they could not assign to any of the previous categories. However, the annotators did not need to use this label for any instance.

Annotation
To annotate the dataset, a two-step process has been followed. In the first step, annotators determined which paragraphs contain PCL. Subsequently, in the second step, the annotators indicated which text spans within these paragraphs contain PCL and they labelled each of these text spans with a particular PCL category. We now discuss these two steps in more detail.

4.2.1
Step 1: Paragraph-Level Identification of PCL The aim of this annotation step is to decide for each paragraph whether or not it contains PCL. This annotation step proved more difficult than expected, stemming from the often subtle and subjective nature of PCL. To mitigate this, we decided to annotate the paragraphs with three possible labels: 0, meaning that the paragraph does not contain PCL, 1, meaning that it is considered to be a borderline case, or 2, meaning that it clearly contains PCL. We computed the Kappa Inter-Annotator Agreement (IAA) between two main annotators (ann1 and ann2) across the three labels, obtaining a moderate agreement of 41%. If we omit all paragraphs which were marked as borderline by at least one annotator, the IAA reaches a substantial 61% (Landis and Koch, 1977). Overall, ann1 and ann2 agreed in 9,182 paragraphs and disagreed in 1457. Among the disagreements, 590 were total disagreements (0 vs 2) and 867 cases included borderline cases. To maximize the amount of information captured by the annotations, and in particular obtain a finer-grained assessment about borderline cases, we combined the labels provided by the two annotators into a 5-point scale, as follows: • Label 0: both annotators assigned the label 0 (0 + 0).
Note how partial disagreement between the annotators is thus reflected in the final label. The cases of total disagreement, where one annotator labeled the instance as clearly not containing PCL and the other annotated it as clearly containing PCL (0 + 2), were annotated by ann3. After this supplementary annotation, the paragraph is either labelled as 1, if the third annotator considered the paragraph not to contain PCL, as 2, if they considered it to be a borderline case, or as 3, if they considered the paragraph to clearly contain PCL. In this way, the labels 0 and 4 remain reserved for clear-cut cases. For the experimental analysis presented in this paper, we treated paragraphs with final labels 0 and 1 as negative examples (i.e. as instances not containing PCL) and paragraphs with final labels 2, 3 and 4 as positive examples (i.e. as instances containing PCL). In total, interpreted in this way, the dataset contains 995 positive examples of PCL.

Step 2: Identifying Span-Level PCL Categories
Those paragraphs labelled as containing PCL in Step 1 are collected for further annotation. The aim of this second step is to specify which text spans within these paragraphs contain PCL and to identify which PCL categories these text spans belong to. For this step, we used the BRAT rapid annotation tool (Stenetorp et al., 2012) 3 . Note that each paragraph might contain one or more text spans with PCL, which may be assigned to the same or to different categories. Table 2 shows how many spans have been labelled with each of the categories.
In Task

Experiments
We experiment with a number of different methods to provide baselines for further research in modeling PCL. We consider two settings: predicting the presence of PCL, viewed as a binary classification task (Task 1), and predicting PCL categories, viewed as a multi-label classification task (Task 2). We evaluate the following methods: • SVM-WV. We use paragraphs embeddings as the input for a Support Vector Machine implemented with SciKit-Learn. To create the paragraphs embeddings, we use the average of the standard 300 dimensional Word2Vec Skip-gram word embeddings trained on the Google News corpus (Mikolov et al., 2013). For Task 1, the parameters that were selected after hyper-parameter tuning were C=10, gamma='scale', kernel='poly', while for Task 2 we found that C=100, gamma='scale', kernel='rbf' yielded the best results on the validation data.
• SVM-BoW. We use a TF-IDF weighted Bag-of-Words representation of the paragraphs as input to an SVM, also implemented with SciKit-Learn. In this case, the hyperparameters that were selected are C=10, gamma= 'scale', kernel= 'rbf' for Task 1 and C=100, gamme='scale', kernel= 'linear' for Task 2.
• BiLSTM. We used a bidirectional LSTM, using the same Word2Vec embeddings as SVM-WV to represent the individual words. As hyper-parameters, we used 20 units for each LSTM layer and a dropout rate of 0.25% at both the LSTM and classification layers. We trained for 300 epochs, using the Adam optimizer, with early stopping and a patience of 10 epochs.
• Fine-tuned Language Models. We fine-tune a BERT language model (Devlin et al., 2018) for sequence classification. We considered two variants of this method, were we respectively used the BERT-large-cased and BERT-base-cased pre-trained models. To further explore the performance of language models, we also fine-tuned a RoBERTa-base (Liu et al., 2019) model, which can be viewed as an optimized version of BERT, and a DistilBERT (Sanh et al., 2019) model, which is a lighter and faster variant of BERT. In all cases, we trained the model for 10 epochs with a batch size of 32. For reproducibility, we fixed the random seeds at 1 in all cases.
• Random. To put the results in context, we include a classifier that relies on random guessing, choosing the positive class with 50% probability in Task 1, and independently selecting each label with a probability of 50% in Task 2.
For both Task 1 and Task 2 we used 10-fold cross validation for all the experiments. For the BiLSTM models, we used 10% of the training data in each fold as a validation set for early stopping. For the SVM models, we instead tuned the hyper-parameters using Grid Search Cross-Validation. As mentioned  Table 3: Results for the problem of detecting PCL, viewed as a binary classification problem (Task 1).
before, for Task 1 we view paragraphs labelled with 0 or 1 as negative examples, and the remaining paragraphs, labelled with 2, 3 or 4, as positive examples. The results are reported in terms of the precision, recall and F1 score of the positive class. Task 2 is viewed as a paragraph-level multi-label classification problem, where each paragraph is assigned a subset of the PCL category labels. Therefore, in these baselines, span boundaries are not used as part of the training data. We report the precision, recall and F1 score of each of the individual category labels.  Table 4: Results for the problem of categorizing PCL, viewed as a paragraph-level multi-label classification problem (Task 2).
The results of Task 1 are summarized in Table 3. As can be seen, all of the considered methods clearly outperform the random baseline. Unsurprisingly, the BERT-based methods achieve the best results, with RoBERTa performing slightly better than DistilBERT and BERT-base. The performance of BERT-large is surprisingly weak compared with the other BERT-based models, performing worse than the BiLSTM. This suggests that BERT-large is more prone to over-fitting, given the relatively small number of training examples. Table 4 shows the results we obtained in Task 2. RoBERTa outperforms the rest of the models in all the categories except for Authority voice, where BERT-large gets the best results. We can also notice the fairly good performance of the SVM methods. In some categories, such as Methaphors, the SVM-WV model performs almost on par with DistilBERT and BERT-base and outperforms the BiLSTM results. For The poorer, the merrier it outperforms all the models except for RoBERTa.
Comparing the results for different categories, we can see that Unbalanced power relations appear relatively easy to detect. This is not unexpected, given that the presence of words such as us, they, must or help are strong and common indicators of such language. For similar reasons, instances of Compassion appear relatively easy to detect. The poorer, the merrier is the least represented category in the entire dataset, with just 64 samples, which can explain the poor results for this category. However, the poor performance for the Metaphor category cannot be explained in this way, given that the number of pos. From his personal story and real-life encounters with poor families, manpower correspondent Toh Yong Chuan suggested shifting the focus from poor parents who repeatedly make bad decisions to their children (Lifting families Out of poverty: Focus on the children; last Thursday). pos.
pos. He said their efforts should not stop only at creating many graduates but also extended to students from poor Families so that they could break away from the cycle of poverty. pos.
neg. "The biggest challenge is the no work policy. I think that refugees who come here, or asylum seekers, they're unable to work and they have kids here -their kids are stateless. That's really the cause of a lot of stress in the community." pos.
neg. "The people of Khyber Pakhtunkhwa are resilient. I did not see hopelessness on any face," he said. pos.
neg. Teach kids to give back: When Kang runs summer camps with kids, she includes "Contribution Fridays" -the kids work together as a team to make sandwiches for the homeless and dole out the food in shelters. pos.
pos. These shocking failures will continue to happen unless the Government tackles the heart of the problem -the chronic underfunding of social care which is piling excruciating pressure on the NHS, leaving vulnerable patients without a lifeline. neg.
pos. Lilly-Hue: His ability to make sure our family is never in need -his sacrificial self. neg.
pos. Any Kenyan small-scale farmer with such an income could not be said to be hopelessly mired in agrarian destitution. But of course, nothing in life is ever so simple as to allow for neat and precise answers. neg.
pos. Selective kindness: In Europe, some refugees are more equal than others. neg. To get further insights into the dataset, Table 5 shows some examples of paragraphs from Task 1, their gold labels and the predictions by RoBERTa. There are three correctly classified instances and seven misclassified examples (i.e. three false negatives and four false positives). In many cases, we can see words and phrases that are often used in PCL, but which are not actually used in a condescending context, causing the model to predict false positives. For instance, in the seventh example, excess of adjectives and flowery wording, e.g. shocking failures and excruciating pressure, are often used in PCL fragments from the Compassion category. In this example, however, it is used in a political context, without being condescending towards any particular group. In the fifth example, the model misclassifies the paragraph as not contaning PCL. In this case, we have an example of the category The poorer, the merrier, which all models struggle to detect. Surprisingly, this category has the highest inter-annotator agreement in the annotation of the dataset. This suggest that, while for human annotators it is very easy to identify cases of this category, the models struggle to detect such cases. In Table 6, some incorrect predictions from Task 2 are presented. Among others, these examples illustrate how RoBERTa struggles to distinguish between presuppositions and authority voices, which are often incorrectly predicted together. Shallow solutions are also often neglected by RoBERTa. A particularly clear case is the last example, where recognizing the presuppositions and shallow solutions in the text will require external knowledge of the situation and the needs of those affected. We can also see examples where the occurrence of a particular structure of language appears to mislead RoBERTa, e.g. to open the doors wider for [...], in the fourth example, seems to lead the model to bet for a shallow solution. Methaphors, as in this same example, are also difficult to identify for RoBERTa in this context. For now the families are staying with friends and family. During the day they clean up the debris left by the fire, hoping that someone will come to their rescue. They received emergency relief packs, but they are still in need of clothes, beds, blankets and kitchen appliances.

Conclusions and Future Work
We have introduced the Don't Patronize Me! dataset, which is aimed at introducing the NLP community to the challenge of identifying and categorizing Patronizing and Condescending Language (PCL) towards vulnerable communities. As another contribution of this paper, we also introduced a two-level taxonomy of PCL categories, which was used for annotating the dataset. Our exploratory analysis shows that identifying condescending or patronizing texts is a difficult challenge, both for human judges and for NLP systems. Apart from the subtle and subjective nature of PCL, a particular challenge comes from the fact that accurately modelling such language often requires knowledge of the world and common sense (e.g. to assess whether a proposed solution is shallow, or whether a particular presupposition is warranted). Nonetheless, we found that both identifying PCL (Task 1) and categorizing occurrences of PCL (Task 2) is feasible, in the sense that non-trivial results can be achieved, with BERT-based approaches outperforming simpler methods. Future work will include the development of new models for both detecting and categorizing PCL. In addition, we plan to continue to extend the Don't Patronize Me! dataset with more paragraphs from news stories, as well as text fragments from different sources, such as social media or NGO campaigns, to create a useful and updated resource for the community.