Socially Responsible and Explainable Hate Speech Detection in Brazilian Portuguese

Language is often used to discriminate, attack, and terrorize people. In the same settings, stereotypes and prejudices are communicated by language and potentially perpetuated at scale on the web and social. While the study on hateful communication is an urgent and relevant issue, there is a significant lack of research concerning explainable hate speech detection, as well as social bias in hate speech technologies. To fill these important research gaps, this research project aims to investigate and provide socially responsible and explainable methods and data resources for hate speech detection in Brazilian Portuguese.

As obtained results, we developed a novel optimized bag-of-words machine learning model witg input representation and contextual lexicon for explainable hate speech detection. Our method embodies explicit and implicit pejorative terms from a specialized lexicon annotated with contextual information. The proposed method overcame literature baselines and it is the current state-of-the-art for Portuguese. In addition, we created the first large-scale expert and explainable annotated corpus for Brazilian hate speech detection, and a specialized multilingual offensive lexicon. The HateBR/HateBRXplain corpus was collected from the comment section of Brazilian politicians' accounts on Instagram and manually annotated by experts. It is composed of 7,000 documents annotated according to three different layers: a binary classification (offensive versus non-offensive comments), offensiveness-level classification (highly, moderately, and slightly offensive), and nine hate speech groups (xenophobia, racism, homophobia, sexism, religious intolerance, partyism, apologist for dictatorship, antisemitism, and fatphobia) . Furthermore, offensive comments include human-annotated rationales. We also proposed a new specialized offensive lexicon called MOL - Multilingual Offensive Lexicon, which was manually identified by a linguist from the proposed HateBR corpus, holds 1,000 explicit and implicit pejorative terms and expressions annotated with contextual information (context-dependent and context-independent). Both the corpus and the lexicon were annotated by three different experts and achieved high inter-annotator agreement. We also developed the first web system for the Brazilian Portuguese offensiveness analysis . The NoHateBrazil web system analyzes fine-grained offensiveness (highly, moderately, and slightly), and provides a new measure to evaluate the reliability of machine-learning prediction, which is shown to the user. Finally, we proposed two new explainable methods to assess discriminatory social bias in machine learning-based hate speech classifiers. The method titled Social Stereotype Analysis (SSA) assesses the potential of hate-speech classifiers to reflect social stereotypes through the investigation of stereotypical beliefs by contrasting them with counter-stereotypes, and the method titled Supervised Rational Attention (SRA) consists of a self-explaining method for hate speech detection that aligns human-annotated rationales with attention mechanisms.

Principal Investigator

Francielle Vargas. University of São Paulo, Brazil

Researchers

Wolfgang Schmeisser. University of Barcelona, Spain

Ali Hürriyetoğlu. Royal Netherlands Academy of Arts and Sciences, Netherlands

Fabiana Góes. Rosalind Franklin Institute, UK

Isabelle Carvalho. University of São Paulo, Brazil

Isadora Salles. Federal University of Minas Gerais, Brazil

Fabrício Benevenuto. Federal University of Minas Gerais, Brazil

Publications

Aligning Attention with Human Rationales for Self-Explaining Hate Speech Detection
Brage Eilertsen, Røskva Bjørgfinsdóttir, Francielle Vargas, Ali Ramezani-Kebrya
40th Annual AAAI Conference on Artificial Intelligence (AAAI’26). Alignment Track. pp. 01–15. Singapore, Singapore. see

HateBRXplain: A Benchmark Dataset with Human-Annotated Rationales for Explainable Hate Speech Detection in Brazilian Portuguese
Isadora Salles, Francielle Vargas, Fabrício Benevenuto
31st International Conference on Computational Linguistics (COLING 2025). Abu Dhabi, UAE. pp. 6659-6669. 2025. see

Context-Aware and Expert Data Resources for Brazilian Portuguese Hate Speech Detection
Francielle Vargas, Isabelle Carvalho, Thiago A.S. Pardo, Fabrício Benevenuto
Natural Language Processing Journal. Cambridge University Press. pp. 1-22. see

Socially Responsible Hate Speech Detection: Can Classifiers Reflect Social Stereotypes?
Franciell Vargas, Isabell Carvalho, Ali Hürriyetoğlu, Thiago A.S. Pardo, Fabrício Benevenuto
Recent Advances in Natural Language Processing (RANLP 2023). pp. 1187–1196. Varna, Bulgaria. see

NoHateBrazil: A Brazilian Portuguese Text Offensiveness Analysis System
Franciell Vargas, Isabelle Carvalho, Wolfgang Schmeisser-Nieto, Fabrício Benevenuto, Thiago A.S. Pardo
Recent Advances in Natural Language Processing (RANLP 2023). pp.1180–1186. Varna, Bulgaria. see

HateBR: A Large Expert Annotated Corpus of Brazilian Instagram Comments for Offensive Language and Hate Speech Detection
Franciell Vargas, Isabelle Carvalho, Fabiana R. Góes, Thiago A.S. Pardo, Fabrício Benevenuto
13th Conference on Language Resources and Evaluation (LREC 2022). pp. 7174–7183. Marseille, France. see

Contextual-Lexicon Approach for Abusive Language Detection
Francielle Vargas, Fabiana R. Góes, Isabelle Carvalho, Fabrício Benevenuto, Thiago A.S. Pardo
Recent Advances in Natural Language Processing (RANLP 2021). pp. 1442-1451. Held Online. see

Resources

Computational Methods

SRA: Supervised Rational Attention for Self-Explaining Hate Speech Detection
B+M: Contextual BoW with Interpretable Input Representation in Hate Speech Detection
SSA: Post-Hoc Counter-Stereotype Explanations for Bias Assessment in Hate Speech Classifiers

Benchmark Datasets

HateBR and HateBRXplain: An Expert Benchmark Dataset for Brazilian Portuguese Hate Speech Detection with Human--Annotated Rationales

Lexicon

MOL : Multilingual offensive lexicon annotated with contextual information.

Software

NoHateBrazil : A Brazilian Portuguese offensive comments analysis system.