uechi.io/source/_drafts/youtube-live-analysis.md

91 lines
3.3 KiB
Markdown
Raw Normal View History

2021-02-13 21:17:15 +09:00
---
2021-05-01 02:19:10 +09:00
title: Exploratory Data Analysis on Vtubers Live Chat
2021-02-13 21:17:15 +09:00
---
2021-05-01 02:19:10 +09:00
A little experiment and analysis on toxic people floating across YouTube.
2021-02-13 21:17:15 +09:00
# Why
2021-05-01 02:19:10 +09:00
The motivation is straightforward; I just feel sad when they suffered from random toxic chats. The goal is also straightforward: design an automated system spotting toxic chat and quarantine them.
2021-02-13 21:17:15 +09:00
2021-02-14 15:47:14 +09:00
# Data, Data, Data
2021-02-13 21:17:15 +09:00
2021-02-14 15:47:14 +09:00
> I can't make bricks without clay.
2021-02-13 22:17:45 +09:00
> — Sherlock Holmes
2021-02-13 21:17:15 +09:00
2021-05-01 02:19:10 +09:00
I need a myriad of live chat comments and moderation events for the experiment.
2021-02-13 21:17:15 +09:00
2021-05-01 02:19:10 +09:00
Unfortunately, YouTube API does not offer a way to retrieve these kinds of events in real time, which is crucial because live streams are only place we can observe moderators' activities (deletion and BAN). Once it gets archived, these events are no longer available to fetch.
2021-02-13 21:17:15 +09:00
## Collecting Crusts
2021-05-01 02:19:10 +09:00
So, I ended up developing a library to accumulate events from a live stream, with a fancy CLI app mimics live chat. It accepts YouTube video id and save live chats in [JSON Lines](https://jsonlines.org/) format:
2021-02-13 21:17:15 +09:00
```bash
collector <videoId>
```
![](realtime-chat.gif)
2021-02-16 15:11:29 +09:00
A line with white text is a normal chat, red text is a ban event, and yellow text is a deletion event.
2021-02-13 21:17:15 +09:00
I know, that's not scalable at all. A new live stream comes in, I copy and paste video id into the terminal and run the script. How sophisticated.
2021-02-16 15:11:29 +09:00
## Make Bread Rise
2021-02-14 15:47:14 +09:00
Thankfully, there's a great web service around Hololive community: [Holotools](https://hololive.jetri.co). They operate an API that gives us an index of past, ongoing, and upcoming live streams from Hololive talents.
2021-02-13 21:17:15 +09:00
2021-02-14 15:47:14 +09:00
Here I divided my system into two components: Scheduler and workers. Scheduler periodically checks for newly scheduled live streams through Holotools API and create a job to be handled by workers. Workers are responsible for handling jobs and spawning a process to collect live chat events.
2021-02-13 21:17:15 +09:00
2021-05-01 02:19:10 +09:00
At this point, saving chat to text files in JSONL format is just ineffective as the throughput grows tremendously, I've managed to switch its data source to MongoDB.
2021-02-13 21:17:15 +09:00
![](scalability.png)
2021-05-01 02:19:10 +09:00
I run the cluster for a while, and by far it hoards approximately one million comments per day. Now I could reliably run my own bakery.
2021-02-13 21:17:15 +09:00
# Look Before You Leap
2021-02-16 15:11:29 +09:00
Okay now there are five million chats sitting on MongoDB store. Let's take a close look at these before actually starting to build a model.
2021-02-13 21:17:15 +09:00
2021-02-16 15:11:29 +09:00
## Troll's Behavior
2021-02-13 21:17:15 +09:00
## By talent
## By language
2021-02-14 15:47:14 +09:00
# Creating Dataset
2021-02-13 21:17:15 +09:00
2021-05-01 02:19:10 +09:00
## Labelling Toxic Chat
2021-02-13 21:17:15 +09:00
### Utilizing Moderators' Activities
2021-05-01 02:19:10 +09:00
### Browser Extension
### Normalized Co-occurrence Entropy
Shannon Entropy is not enough. So I combined the ideas of [Burrows-Wheeler Transform](https://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform) and [Run-length Encoding](https://en.wikipedia.org/wiki/Run-length_encoding) to formulate a new entropy which represents "spamminess" of given text.
2021-02-13 21:17:15 +09:00
$$
2021-02-14 15:47:14 +09:00
NCE(T) = \frac{N_T}{RLE_{string}(BWT(T))}
2021-02-13 21:17:15 +09:00
$$
$$
BWT[T,i] = \begin{cases} T[SA[i]-1], & \text{if }SA[i] > 0\\ \$, & \text{otherwise}\end{cases}
$$
2021-05-01 02:19:10 +09:00
### Sentence Encoding
2021-02-13 21:17:15 +09:00
Here's a [t-SNE](https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding) visualization for output of Sentence Transformer. Blue dots are spam and orange dots are normal chats.
![](tsne-sentence-encoding.png)
# Learn
## Gradient Boosting
## Neural Networks
# Future
2021-02-14 15:47:14 +09:00
When it's ready, I'm going to publish a dataset and pre-trained model used in this experiment.