The motivation is straightforward; I just feel sad when they sound suffered from toxic chats. The goal is also straightforward: design an automated system to spot toxic chat and quarantine them.
Unfortunately, YouTube API does not offer a way to retrieve these kinds of events in real time. Which is crucial because live streams are only place we can observe moderators' actions (deletion and BAN). Once it gets archived, these activities are no longer observable.
So, I ended up developing a library to accumulate events from a YouTube live stream, with a fancy CLI app mimics live chat. It accepts YouTube video id and save live chats in [JSON Lines](https://jsonlines.org/) format:
Thankfully, there's a great web service around Hololive community: [Holotools](https://hololive.jetri.co). They operate an API that gives us an index of past, ongoing, and upcoming live streams from Hololive talents.
Here I divided my system into two components: Scheduler and workers. Scheduler periodically checks for newly scheduled live streams through Holotools API and create a job to be handled by workers. Workers are responsible for handling jobs and spawning a process to collect live chat events.
Shannon Entropy is not enough. So I combined the ideas of [Burrows-Wheeler Transform](https://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform) and [Run-length Encoding](https://en.wikipedia.org/wiki/Run-length_encoding) to formulate a new entropy which represents "spamminess" of given text.
Here's a [t-SNE](https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding) visualization for output of Sentence Transformer. Blue dots are spam and orange dots are normal chats.