2021-09-03 00:33:23 +09:00
# ❤️🩹 Sensai: Toxic Chat Dataset
2021-09-02 23:44:18 +09:00
2021-09-03 00:33:23 +09:00
Sensai is a toxic chat dataset consists of live chats from Virtual YouTubers' live streams.
2021-09-02 23:44:18 +09:00
2021-09-03 00:33:23 +09:00
Download the dataset from [Kaggle Datasets ](https://www.kaggle.com/uetchy/sensai ) and join `#livechat-dataset` channel on [holodata Discord ](https://holodata.org/discord ) for discussions.
2021-09-02 23:44:18 +09:00
## Provenance
2021-09-03 00:33:23 +09:00
- **Source:** YouTube Live Chat events (all streams covered by [Holodex ](https://holodex.net ), including Hololive, Nijisanji, 774inc, etc)
- **Temporal Coverage:** From 2021-01-15T05:15:33Z
- **Update Frequency:** At least once per month
2021-09-02 23:44:18 +09:00
## Research Ideas
- Toxic Chat Classification
- Spam Detection
- Sentence Transformer for Live Chats
2021-09-03 00:33:23 +09:00
See [public notebooks ](https://www.kaggle.com/uetchy/sensai/code ) for ideas.
2021-09-02 23:44:18 +09:00
2021-09-03 00:33:23 +09:00
## Files
2021-09-02 23:44:18 +09:00
| filename | summary | size |
| ------------------------- | -------------------------------------------------------------- | -------- |
| `chats_flagged_%Y-%m.csv` | Chats flagged as either deleted or banned by mods (3,100,000+) | ~ 400 MB |
2021-09-03 00:33:23 +09:00
| `chats_nonflag_%Y-%m.csv` | Non-flagged chats (3,100,000+) | ~ 300 MB |
2021-09-02 23:44:18 +09:00
2021-09-03 00:33:23 +09:00
To make it a balanced dataset, the number of `chats_nonflags` is adjusted (randomly sampled) to be the same as `chats_flagged` .
Ban and deletion are equivalent to `markChatItemsByAuthorAsDeletedAction` and `markChatItemAsDeletedAction` respectively.
2021-09-02 23:44:18 +09:00
2021-09-03 00:33:23 +09:00
## Dataset Breakdown
2021-09-02 23:44:18 +09:00
### Chats (`chats_%Y-%m.csv`)
2021-09-03 00:33:23 +09:00
| column | type | description |
| --------------- | ------ | ---------------------------- |
| body | string | chat message |
| membership | string | membership status |
| authorChannelId | string | anonymized author channel id |
| channelId | string | source channel id |
2021-09-02 23:44:18 +09:00
#### Membership status
| value | duration |
| ----------------- | ------------------------- |
| unknown | Indistinguishable |
| non-member | 0 |
| less than 1 month | < 1 month |
| 1 month | >= 1 month, < 2 months |
| 2 months | >= 2 months, < 6 months |
| 6 months | >= 6 months, < 12 months |
| 1 year | >= 12 months, < 24 months |
| 2 years | >= 24 months |
#### Pandas usage
Set `keep_default_na` to `False` and `na_values` to `''` in `read_csv` . Otherwise, chat message like `NA` would incorrectly be treated as NaN value.
```python
import pandas as pd
from glob import iglob
2021-09-03 00:33:23 +09:00
flagged = pd.concat([
2021-09-02 23:44:18 +09:00
pd.read_csv(f,
na_values='',
2021-09-03 00:33:23 +09:00
keep_default_na=False)
for f in iglob('../input/sensai/chats_flagged_*.csv')
2021-09-02 23:44:18 +09:00
],
2021-09-03 00:33:23 +09:00
ignore_index=True)
2021-09-02 23:44:18 +09:00
```
## Consideration
### Anonymization
2021-09-03 00:33:23 +09:00
`authorChannelId` are anonymized by SHA-1 hashing algorithm with a pinch of undisclosed salt.
2021-09-02 23:44:18 +09:00
### Handling Custom Emojis
All custom emojis are replaced with a Unicode replacement character `U+FFFD` .
## Citation
```latex
2021-09-03 00:33:23 +09:00
@misc {sensai-dataset,
2021-09-02 23:44:18 +09:00
author={Yasuaki Uechi},
2021-09-03 00:33:23 +09:00
title={Sensai: Toxic Chat Dataset},
2021-09-02 23:44:18 +09:00
year={2021},
2021-09-03 00:33:23 +09:00
month={8},
2021-09-02 23:44:18 +09:00
version={31},
2021-09-03 00:33:23 +09:00
url={https://github.com/holodata/sensai-dataset}
2021-09-02 23:44:18 +09:00
}
```
## License
2021-09-03 00:33:23 +09:00
- Code: [MIT License ](https://github.com/holodata/sensai-dataset/blob/master/LICENSE )
2021-09-02 23:44:18 +09:00
- Dataset: [ODC Public Domain Dedication and Licence (PDDL) ](https://opendatacommons.org/licenses/pddl/1-0/index.html )