LiLaH (The Linguistic Landscape of Hate Speech in Social Media)

LiLaH (The Linguistic Landscape of Hate Speech in Social Media) is an FWO (Flemish NSF) and ARIS (Slovenian Research and Innovation Agency) funded project focusing on building systems that can automatically recognize and analyse hate speech in social media texts. We are interested in the linguistic properties of the language that is being used to express hate in social media, specifically hate against migrants and LGBT people, and in automatically detecting it. The languages addressed are English, Dutch, Slovene, Croatian and French.

The project is a cooperation between The Centre for Computational Linguistics and Psycholinguistics (CLiPS) (University of Antwerp, Belgium), The Department of Translation (University of Ljubljana, Slovenia) and The Department of Knowledge Technologies (Jozef Stefan Institute, Slovenia).

People

Ljubljana Team

Darja Fišer (Department of Translation, University of Ljubljana)
Tomaž Erjavec (Department of Knowledge Technologies, Jozef Stefan Institute)
Nikola Ljubešić (Department of Knowledge Technologies, Jozef Stefan Institute)
Kristina Pahor de Maiti (Department of Knowledge Technologies, Jozef Stefan Institute)
Jasmin Franza (Department of Knowledge Technologies, Jozef Stefan Institute)

Antwerp Team

Walter Daelemans (CLiPS, University of Antwerp)
Ilia Markov (CLiPS, University of Antwerp)
Tom De Smedt (CLiPS, University of Antwerp and textgain)

Nominations & awards

Tom De Smedt, 2019: nomination, monitoring online extremism, Research Grant, Auschwitz Foundation (1st)

Acknowledgments

The project ARRS N6-0099 and FWO G070619N: "The linguistic landscape of hate speech on social media", 2019 – 2023

Publications

N. Yuzbashyan, N. Banar, I. Markov, W. Daelemans (2023). An Exploration of Zero-Shot Natural Language Inference-Based Hate Speech Detection. In: Third Workshop on Language Technology for Equality, Diversity and Inclusion (LT-EDI 2023), Varna, Bulgaria, ACL, pp. 1–9

L. Hilte, I. Markov, N. Ljubešić, D. Fišer, W. Daelemans (2023). Who are the Haters? A Corpus-Based Demographic Analysis of Authors of Hate Speech. Frontiers in Artificial Intelligence, vol. 6

J. Lemmens, I. Markov, W. Daelemans (2023). The LiLaH Emotion Lexicon of Greek, Kurdish, Turkish, Spanish, Farsi and Chinese. CLiPS Technical Report Series, CTRS-009

I. Gevers, I. Markov, W. Daelemans (2022). Linguistic Analysis of Toxic Language on Social Media. Computational Linguistics in the Netherlands Journal, vol. 12, pp. 33–48

I. Markov, W. Daelemans (2022). The Role of Context in Detecting the Target of Hate Speech. In: Third Workshop on Threat, Aggression and Cyberbullying (TRAC 2022), Gyeongju, Republic of Korea, ACL, pp. 37–42

I. Markov, I. Gevers, W. Daelemans (2022). An Ensemble Approach for Dutch Cross-Domain Hate Speech Detection. In: 27th International Conference on Natural Language & Information Systems (NLDB 2022), Valencia, Spain. LNCS, Springer, vol. 13286, pp. 3–15

B. Evkoski, A. Pelicon, I. Mozetič, N. Ljubešić, P. K. Novak (2021). Retweet communities reveal the main sources of hate speech. arXiv preprint arXiv:2105.14898

B. Evkoski, I. Mozetič, N. Ljubešić, P. K. Novak (2021). Community evolution in retweet networks. PLOS ONE 16(9): e0256175

N. Ljubešić, D. Lauc (2021). BERTić--The Transformer Language Model for Bosnian, Croatian, Montenegrin and Serbian. arXiv preprint arXiv:2104.09243

Y. Scherrer, N. Ljubešić (2021). Social Media Variety Geolocation with geoBERT. In: Eighth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2021), Kiyv, Ukraine, ACL, pp. 135–140

J. Lemmens, I. Markov, W. Daelemans (2021). Improving Hate Speech Type and Target Detection with Hateful Metaphor Features. In: Fourth Workshop on NLP for Internet Freedom: Censorship, Disinformation, and Propaganda (NLP4IF 2021), Online, ACL, pp. 7–16

I. Markov, W. Daelemans (2021). Improving Cross-Domain Hate Speech Detection by Reducing the False Positive Rate. In: Fourth Workshop on NLP for Internet Freedom: Censorship, Disinformation, and Propaganda (NLP4IF 2021), Online, ACL, pp. 17–22

I. Markov, N. Ljubešić, D. Fišer, W. Daelemans (2021). Exploring Stylometric and Emotion-Based Features for Multilingual Cross-Domain Hate Speech Detection. In: Eleventh Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis (WASSA 2021), Online, ACL, pp. 149–159

Z. Fijavž, D. Fišer (2020). Corpus-assisted analysis of water flow metaphors in Slovene online news migration discourse of 2015. In: The Dark Side of Digital Platforms: Linguistic Investigations of socially unacceptable online discourse practices, pp. 56–84

V. Gorjanc, D. Fišer (2020). Twitter discourse on LGBTQ+ in Slovenia. In: The Dark Side of Digital Platforms: Linguistic Investigations of socially unacceptable online discourse practices, pp. 36–55

K. Pahor de Maiti, D. Fišer, N. Ljubešić (2020). Nonstandard linguistic features of Slovene socially unacceptable discourse on Facebook. In: The Dark Side of Digital Platforms: Linguistic Investigations of socially unacceptable online discourse practices, pp. 12–34

K. Pahor de Maiti, D. Fišer, N. Ljubešić, T. Erjavec (2020). Grammatical Footprint of Socially Unacceptable Facebook Comments. In: Conference on Language Technologies & Digital Humanities, Ljubljana: Inštitut za novejšo zgodovino, pp. 48–57

K. Pahor de Maiti, D. Fišer, N. Ljubešić, T. Erjavec (2020). Analiza kazalnih zaimkov v družbeno nesprejemljivih spletnih komentarjih. In: Slovenščina-diskurzi, zvrsti in jeziki med identiteto in funkcijo, pp. 89–99

N. Ljubešić, I. Markov, D. Fišer, W. Daelemans (2020). The LiLaH Emotion Lexicon of Croatian, Dutch and Slovene. In: Third Workshop on Computational Modeling of People's Opinions, Personality, and Emotion's in Social Media (PEOPLES 2020), Barcelona, Spain (Online), ACL, pp. 153–157

D. Fišer, M. K. Golob (2019). Corporate Communication on Twitter in Slovenia. Contributions to Contemporary History, 59(1), pp. 46–69

N. Ljubešić, K. Dobrovoljc (2019). What does Neural Bring? Analysing Improvements in Morphosyntactic Annotation and Lemmatisation of Slovenian, Croatian and Serbian. In: 7th Workshop on Balto-Slavic Natural Language Processing (BSNLP 2019), Florence, Italy, ACL, pp. 29–34

J. Franza, D. Fišer (2019). The lexical inventory of Slovene socially unacceptable discourse on Facebook. In: 7th Conference on Computer-Mediated Communication (CMC) and Social Media Corpora (CMC-Corpora 2019), Cergy-Pontoise, France, pp. 43–47

K. Pahor de Maiti, D. Fišer, N. Ljubešić (2019). How haters write: analysis of nonstandard language in online hate speech. In: 7th Conference on Computer-Mediated Communication (CMC) and Social Media Corpora (CMC-Corpora 2019), Cergy-Pontoise, France, pp. 37–42

N. Ljubešić, D. Fišer, T. Erjavec (2019). The FRENK Datasets of Socially Unacceptable Discourse in Slovene and English. In: International Conference on Text, Speech, and Dialogue (TSD 2019), Ljubljana, Slovenia, pp. 103–114

Tools

N.Ljubešić (2019). The CLASSLA-StanfordNLP model for UD dependency parsing of standard Slovenian. Slovenian language resource repository CLARIN.SI

https://huggingface.co/classla/roberta-base-frenk-hate: text classification model based on roberta-base and fine-tuned on the FRENK dataset

https://huggingface.co/classla/sloberta-frenk-hate: text classification model based on EMBEDDIA/sloberta and fine-tuned on the FRENK dataset

https://huggingface.co/classla/bcms-bertic-frenk-hate: text classification model based on classla/bcms-bertic and fine-tuned on the FRENK dataset

Resources

J. Lemmens, I. Markov, W. Daelemans (2023). The LiLaH Emotion Lexicon of Greek, Kurdish, Turkish, Spanish, Farsi and Chinese Instituut voor de Nederlandse taal / taalmaterialen

I. Markov, L. Hilte, N. Ljubešić, D. Fišer, W. Daelemans (2022). Facebook metadata dataset LiLaH-HAG Slovenian language resource repository CLARIN.SI

B. Evkoski, A. Pelicon, I. Mozetič, N.Ljubešić, P. K., Novak (2021). Slovenian Twitter dataset 2018-2020 1.0. Slovenian language resource repository CLARIN.SI

N. Ljubešić, D. Fišer, D. Kranjčić (2021). Offensive language dataset of Croatian, English and Slovenian comments FRENK 1.0. Slovenian language resource repository CLARIN.SI

W. Daelemans, D. Fišer, J. Franza, D. Kranjčić, J. Lemmens, N. Ljubešić, I. Markov, D. Popič (2020). The LiLaH Emotion Lexicon of Croatian, Dutch and Slovene. Slovenian language resource repository CLARIN.SI

Related work

E. Kotzé, B. A. Senekal, W. Daelemans (2020). Automatic classification of social media reports on violent incidents in South Africa using machine learning. South African Journal of Science. Vol. 116, No. 3–4, pp. 1–8

E. Kotzé, B. A. Senekal, W. Daelemans (2020). Exploring the Classification of Security Events using Sparse and Dense Representation of Text. International SAUPEC/RobMech/PRASA Conference, Cape Town, South Africa, pp. 1–6

S. Jaki, T. De Smedt, M. Gwóźdź, R. Panchal, A. Rossa, G. De Pauw (2019). Online hatred of women in the Incels.me forum: Linguistic analysis and automatic detection. Journal of Language Aggression and Conflict, Vol. 7, No. 2, pp. 240–268

S. Jaki, T. De Smedt (2018). Right-wing German Hate Speech on Twitter: Analysis and Automatic Detection. arXiv preprint arXiv:1910.07518

T. De Smedt, S. Jaki, E. Kotzé, L. Saoud, M. Gwóźdź, G. De Pauw, W. Daelemans (2018). Multilingual Cross-domain Perspectives on Online Hate Speech

P. Fortuna, S. Nunes (2018). A survey of automatic detection of hate speech in text. ACM Computing Surveys, Vol. 51, No. 4, article 85

Funders

FWO ARIS