mirror of
https://github.com/holodata/sensai-dataset.git
synced 2025-03-15 12:00:32 +09:00
123 lines
3.9 KiB
Markdown
123 lines
3.9 KiB
Markdown
# ❤️🩹 Sensai: Toxic Chat Dataset
|
|
|
|
Sensai is a toxic chat dataset consists of live chats from Virtual YouTubers' live streams.
|
|
|
|
Download the dataset from [Huggingface Hub](https://huggingface.co/datasets/holodata/sensai) or alternatively from [Kaggle Datasets](https://www.kaggle.com/uetchy/sensai).
|
|
|
|
Join `#livechat-dataset` channel on [holodata Discord](https://holodata.org/discord) for discussions.
|
|
|
|
## Provenance
|
|
|
|
- **Source:** YouTube Live Chat events (all streams covered by [Holodex](https://holodex.net), including Hololive, Nijisanji, 774inc, etc)
|
|
- **Temporal Coverage:** From 2021-01-15T05:15:33Z
|
|
- **Update Frequency:** At least once per month
|
|
|
|
## Research Ideas
|
|
|
|
- Toxic Chat Classification
|
|
- Spam Detection
|
|
- Sentence Transformer for Live Chats
|
|
|
|
See [public notebooks](https://www.kaggle.com/uetchy/sensai/code) for ideas.
|
|
|
|
## Files
|
|
|
|
| filename | summary | size |
|
|
| ------------------------- | -------------------------------------------------------------- | -------- |
|
|
| `chats_flagged_%Y-%m.csv` | Chats flagged as either deleted or banned by mods (3,100,000+) | ~ 400 MB |
|
|
| `chats_nonflag_%Y-%m.csv` | Non-flagged chats (3,100,000+) | ~ 300 MB |
|
|
|
|
To make it a balanced dataset, the number of `chats_nonflags` is adjusted (randomly sampled) to be the same as `chats_flagged`.
|
|
Ban and deletion are equivalent to `markChatItemsByAuthorAsDeletedAction` and `markChatItemAsDeletedAction` respectively.
|
|
|
|
## Dataset Breakdown
|
|
|
|
### Chats (`chats_%Y-%m.parquet`)
|
|
|
|
| column | type | description |
|
|
| --------------- | ------ | ----------------------------- |
|
|
| body | string | chat message |
|
|
| authorChannelId | string | anonymized author channel id |
|
|
| channelId | string | source channel id |
|
|
| label | string | {deleted, hidden, nonflagged} |
|
|
|
|
## Usage
|
|
|
|
### Pandas
|
|
|
|
```python
|
|
import pandas as pd
|
|
from glob import glob
|
|
|
|
df = pd.concat([pd.read_parquet(x) for x in glob('../input/sensai/*.parquet')], ignore_index=True)
|
|
```
|
|
|
|
### Huggingface Transformers
|
|
|
|
https://huggingface.co/docs/datasets/loading_datasets.html
|
|
|
|
```python
|
|
# $ pip3 install datasets
|
|
from datasets import load_dataset, Features, ClassLabel, Value
|
|
|
|
dataset = load_dataset("holodata/sensai",
|
|
features=Features(
|
|
{
|
|
"body": Value("string"),
|
|
"toxic": ClassLabel(num_classes=2, names=['0', '1'])
|
|
}
|
|
))
|
|
|
|
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=2)
|
|
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
|
|
|
|
def tokenize_function(examples):
|
|
return tokenizer(examples["body"], padding="max_length", truncation=True)
|
|
|
|
tokenized_datasets = dataset['train'].shuffle().select(range(50000)).map(tokenize_function, batched=True)
|
|
tokenized_datasets.rename_column_("toxic", "label")
|
|
splitset = tokenized_datasets.train_test_split(0.2)
|
|
training_args = TrainingArguments("test_trainer")
|
|
|
|
trainer = Trainer(
|
|
model=model, args=training_args, train_dataset=splitset['train'], eval_dataset=splitset['test']
|
|
)
|
|
|
|
trainer.train()
|
|
```
|
|
|
|
### Tangram
|
|
|
|
```bash
|
|
python3 ./examples/prepare_tangram_dataset.py
|
|
tangram train --file ./tangram_input.csv --target label
|
|
```
|
|
|
|
## Consideration
|
|
|
|
### Anonymization
|
|
|
|
`authorChannelId` are anonymized by SHA-1 hashing algorithm with a pinch of undisclosed salt.
|
|
|
|
### Handling Custom Emojis
|
|
|
|
All custom emojis are replaced with a Unicode replacement character `U+FFFD`.
|
|
|
|
## Citation
|
|
|
|
```latex
|
|
@misc{sensai-dataset,
|
|
author={Yasuaki Uechi},
|
|
title={Sensai: Toxic Chat Dataset},
|
|
year={2021},
|
|
month={8},
|
|
version={31},
|
|
url={https://github.com/holodata/sensai-dataset}
|
|
}
|
|
```
|
|
|
|
## License
|
|
|
|
- Code: [MIT License](https://github.com/holodata/sensai-dataset/blob/master/LICENSE)
|
|
- Dataset: [ODC Public Domain Dedication and Licence (PDDL)](https://opendatacommons.org/licenses/pddl/1-0/index.html)
|