Toxic speech has become an existential challenge for social media platforms, compelling them to invest heavily in the monitoring and removal of hateful content. But what does toxicity really mean?  And can we successfully train machines to detect it? In conversation with Nicholas Carlisle, Dr Gareth Tyson, a leading researcher in the dark side of the Internet, tells us how researchers are tackling online toxicity through the power of computer science.

Nicholas Carlisle:  The subject you’re studying affects so many people across the world. They might not use the word toxicity to describe it, but they have likely experienced it. Can you help our readers understand how researchers define online toxicity?

Dr Gareth Tyson: I should emphasize that I’m coming at this question from a computer science discipline, so our definitions of toxicity are embedded in data. Most of the research in this area is looking at ways to detect toxicity in an automated fashion. We take a large amount of data; for example, in the case of Twitter this would be lots of tweets. We then ask people to annotate the data by telling us whether or not they think that particular post is toxic or non-toxic. Once we have that we essentially try to build machine brains, which look at that data and learn its patterns. Then when given new data they can guess whether the humans would have annotated it as being toxic or otherwise.

NC: And how are they annotating the data? Are there guidelines?

GT: Typically the research leads will draw up a definition that they consider toxic. The one that’s probably most widely used was pioneered by Jigsaw Perspective. Their definition of toxicity is essentially rude, impolite, disrespectful commentary which – and this is the important bit – leads somebody to leave a discussion.

NC:  So do people have to actually leave a platform before speech is considered toxic?

GT: At that stage the machine algorithms are only estimating the likelihood that someone would leave. So you couldn’t know. The way that it’s often quantified is a number between zero and one. If the language is incredibly toxic and aggressive, then its value will be closer to 1. If it’s less aggressive, it’s closer to 0.

But these values can hide a lot of complexity. This was particularly a problem in the early research in the field. When you give ten people the same post and ask them to classify it, often you’ll have disagreements. Some people will say that it’s hateful, and others might say it’s just a joke.

Since that early work people have tried to break into much more specific definitions, moving away from those generic  concepts of hate or toxicity. For instance, there have been works that looked at Islamophobia, anti-Semitism, misogyny, and within those they tried to give much more formal definitions of toxicity. Even within these categories there are different forms of toxic speech: for example broadly misogynistic language is different from targeted misogynistic threats. So you get the annotators to look through these guidelines before they label the data. It’s very important that you get multiple people to do this, so if there’s disagreement, you can consolidate those results into a single output.

NC: I’m curious to know if any of these machine models are having success? And how would you define success?

GT: In the computer science world, success is always quantified. There are several straightforward statistics that could be used to quantify it: accuracy is the most popular. That’s basically the fraction of the machine model’s guesses that were correct.

However, what tends to happen is there’s implicit bias baked into the data collection process. For instance, on the day you collect your data there may have been a specific hate attack in London, leading all of the discussions online to be very narrow and focused. If you train the model on this data, it’s high accuracy score won’t mean much. It will learn very easily from narrow data sets like this, but of course this learning will not be generalizable because it was specific to one event.

It also obviously wouldn’t make sense to test your model on the same data you trained it on. Instead, we might collect 100 tweets and use 80 of those to train the model. Then we take the 20 tweets that the computer hasn’t seen, and test it on those.

NC: Can you share with us any great success story in terms of creating a model that works?

GT: In terms of successful stories, most of the hate speech or toxic content models work relatively well for generic content. This is because there are certain traits in language that are very often associated with hateful or toxic behavior regardless of the topic. So in many cases, the hate speech classifiers that we’re building can accurately interpret three quarters of the data you put in. The problem is that there’s a lot of content at the fringe which doesn’t fall into that mainstream category. For example, emerging fringe communities like the Incels might introduce new terminology that in other contexts means nothing. Often the models can’t keep up with that.

But companies such as Google and Facebook have massive pools of data and are building pre-trained models in a way that academic researchers could never dream of. And then we’re using the knowledge they’ve baked into that model to be able to build our toxic content classifiers. With this comes a much more robust understanding of semantics, and we can get much better performance.

NC: What do you do when content is being spread as memes, not as words?

GT: There’s been a huge increase in the use of memes and multimodal hate, where people aren’t just disseminating by text, but using images as well.  So we were looking at where memes come from and how they spread, and what we found is that an inordinate proportion of them come from these fringe communities. The memes are then dropped in a coordinated fashion onto other platforms. It becomes way more difficult to detect and take down this type of toxic content. Because of course you’ve got two modalities to process: image and text. Importantly, the image and the text can’t really be processed in isolation.

This makes memes and other forms of image-based toxicity much more difficult to detect. So we’ve been really struggling to build robust classifiers that generalize well against all sorts of memes. In many cases, companies like Facebook are now relying on techniques such as image hashing that helps them to spot variants of similar images. Once you’ve manually identified an image as ‘bad’, it then becomes much easier to remove its variants.

NC: When we think about the unique type of toxicity that is found online, how does it get spread and disseminated?

GT: Historically we’ve always studied it in a uni-platform fashion, so we go to a platform like Twitter and we collect lots of data. We identify what’s toxic, and then we look at who retweets it. The most interesting thing is that often toxic content spreads faster and further than non-toxic content. Recently researchers are looking into how toxicity spreads in a multi-platform way, such as ‘raids’ and ‘coordinated actions’. In the case of raids, what happens is co-ordinated subsets of the population orchestrate attacks on particular communities.

A really powerful example of this is 4Chan, which is a relatively well-known fringe platform. It works in a simple way. You have anonymous posts and there’s a subculture which by and large is themed around various types of hate. As part of this, people on 4Chan coordinate their user base to migrate onto other platforms and do something disruptive. For example, if you have a YouTube video and the person speaking on the video is disagreeing with the 4Chan users, somebody on 4Chan will post a link to that YouTube video and say the phrase “you know what to do”. Then all of a sudden a huge number of users will migrate from 4Chan onto that YouTube page and start posting hateful and negative comments.

That can be distressing for the person who posted that video, particularly if it’s live. I actually have a friend and a colleague who experienced this. He works on this topic and he’s done measurements in the 4Chan community. He was doing a presentation about the topic on YouTube, and when the community spotted this they coordinated a raid against him, such that the comment bar started to fill with abuse.

NC: When we last talked you mentioned that some researchers feel we should not shut down toxic speech, but should rely on counter speech instead. Can you tell me more about this?

GT: I should flag that it’s probably not a well-established or well-agreed strategy. Many diverse opinions exist in this space. One argument is for shutting down and de-platforming, so that’s where you remove somebody from the platform and prevent them from communicating. But what is the best counter strategy to employ against a given user or given coordinated attack? I would probably argue that in the majority of the cases it’s counter speech. The reason for that is because when users get suspended or banned from platforms they don’t just disappear off the face of the earth. They either take that resentment and bitterness into the physical world, or they migrate into a different space.

That person may simply create another account on the same platform. They start from scratch again. This is a concept referred to as whitewashing, where you suspend somebody, but the accounts are so easy to create that they just pop up again somewhere else.

The other worrying outcome of banning somebody is that they just get pushed further into the quagmire. Whereas they may have started voicing their initial concerns on Twitter, they now find themselves pushed into other platforms where they don’t get confronted with a slightly more diverse opinion base. They instead find themselves in an echo chamber where they’re preaching to the converted.

The outcome is that the rest of us don’t know what their viewpoint was anymore because it’s been pushed into the shadows. I personally feel that suspending sometimes has to be done, but it should be a last resort. Counter speech tends to do a better job because at the very least you know what’s going on, and in the best case you actually change the person’s opinion.

NC: From what you have seen in your research do you think participation in online spaces has the capacity to increase an individual’s level of hate?

GT: I think the sad answer to that question is yes, but in the same way that interaction between any humans has the capacity to increase hate. I don’t think it’s inherent to the fact that this is online. It’s just the fact that interactions in general can foster hate in the same way they can foster any other emotion.

The difference is the online space has the ability to connect people in a way that’s never been possible before. For example, before the Internet, if someone had decided they were going to become Anti-Semitic, it would have been very difficult for them to find a community. They wouldn’t have a clue where to go to. But with the Internet they could probably enter a few search words into Google and immediately find others that share their feelings and connecting with this new community would reinforce their views.

There’s also something called the online disinhibition effect. A lot of our inhibitions drop away when we’re online, so I might say and do things in a semi-anonymous online fashion that I would just never, ever say face to face.

NC: Quite a few people are pessimistic about the future of online spaces. They see the Internet as a gathering place that is getting increasingly degraded. I’ve seen comparisons made to the tragedy of the commons, in which individuals with access to a shared resource (e.g shared meadows for grazing cattle which were often referred to as ” the commons”) act in their own interest and, in doing so, ultimately deplete the resource. What’s your prognosis?

GT: It’s hard to deny that there are major problems in many online spaces. But I think one thing to emphasize is that we are really in the nascent stages of this. Just 20 years ago we didn’t have Twitter or Instagram. What I suspect will happen is as the years pass and more of our physical world migrates into the online realms, we’ll start to get a better understanding and appreciation of the norms that we’ve evolved over thousands of years in the physical world.

And in parallel to that, I think there’s a lot of good education work going on. People being born today are going to grow up as digital natives with their schooling themed around their lives online, particularly with the pandemic exacerbating that fact. I just hope that people naturally migrate towards a more positive outlook in the same way that over thousands of years we’ve evolved to interact in a more positive fashion. We don’t shout abuse at people traveling with us on the train, so why should they be shouting abuse at people on Twitter?

Dr Gareth Tyson is a senior lecturer in Computer Science at Queen Mary University. He is also a Fellow at the Alan Turing Institute, co-leads the Social Science Lab, and is Deputy Director of the Institute of Applied Data Science, among other roles. Alongside contributors, Dr Tyson’s research has been awarded numerous accolades including Facebook’s Shared Task on Hateful Memes Prize 2021.