sensai/README.md

# ❤️‍🩹 Sensai: Toxic Chat Dataset

Sensai is a toxic chat dataset consists of live chats from Virtual YouTubers' live streams.

Download the dataset from [Huggingface Hub](https://huggingface.co/datasets/holodata/sensai) or alternatively from [Kaggle Datasets](https://www.kaggle.com/uetchy/sensai).

Join `#livechat-dataset` channel on [holodata Discord](https://holodata.org/discord) for discussions.

## Provenance

- **Source:** YouTube Live Chat events (all streams covered by [Holodex](https://holodex.net), including Hololive, Nijisanji, 774inc, etc)
- **Temporal Coverage:** From 2021-01-15T05:15:33Z
- **Update Frequency:** At least once per month

## Research Ideas

- Toxic Chat Classification
- Spam Detection
- Sentence Transformer for Live Chats

See [public notebooks](https://www.kaggle.com/uetchy/sensai/code) for ideas.

## Files

| filename                  | summary                                                        | size     |
| ------------------------- | -------------------------------------------------------------- | -------- |
| `chats_flagged_%Y-%m.csv` | Chats flagged as either deleted or banned by mods (3,100,000+) | ~ 400 MB |
| `chats_nonflag_%Y-%m.csv` | Non-flagged chats (3,100,000+)                                 | ~ 300 MB |

To make it a balanced dataset, the number of `chats_nonflags` is adjusted (randomly sampled) to be the same as `chats_flagged`.
Ban and deletion are equivalent to `markChatItemsByAuthorAsDeletedAction` and `markChatItemAsDeletedAction` respectively.

## Dataset Breakdown

### Chats (`chats_%Y-%m.parquet`)

| column          | type   | description                   |
| --------------- | ------ | ----------------------------- |
| body            | string | chat message                  |
| authorChannelId | string | anonymized author channel id  |
| channelId       | string | source channel id             |
| label           | string | {deleted, hidden, nonflagged} |

## Usage

### Pandas

```python
import pandas as pd
from glob import glob

df = pd.concat([pd.read_parquet(x) for x in glob('../input/sensai/*.parquet')], ignore_index=True)
```

### Huggingface Transformers

https://huggingface.co/docs/datasets/loading_datasets.html

```python
# $ pip3 install datasets
from datasets import load_dataset, Features, ClassLabel, Value

dataset = load_dataset("holodata/sensai",
    features=Features(
        {
            "body": Value("string"),
            "toxic": ClassLabel(num_classes=2, names=['0', '1'])
        }
    ))

model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=2)
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

def tokenize_function(examples):
    return tokenizer(examples["body"], padding="max_length", truncation=True)

tokenized_datasets = dataset['train'].shuffle().select(range(50000)).map(tokenize_function, batched=True)
tokenized_datasets.rename_column_("toxic", "label")
splitset = tokenized_datasets.train_test_split(0.2)
training_args = TrainingArguments("test_trainer")

trainer = Trainer(
    model=model, args=training_args, train_dataset=splitset['train'], eval_dataset=splitset['test']
)

trainer.train()
```

### Tangram

```bash
python3 ./examples/prepare_tangram_dataset.py
tangram train --file ./tangram_input.csv --target label
```

## Consideration

### Anonymization

`authorChannelId` are anonymized by SHA-1 hashing algorithm with a pinch of undisclosed salt.

### Handling Custom Emojis

All custom emojis are replaced with a Unicode replacement character `U+FFFD`.

## Citation

```latex
@misc{sensai-dataset,
 author={Yasuaki Uechi},
 title={Sensai: Toxic Chat Dataset},
 year={2021},
 month={8},
 version={31},
 url={https://github.com/holodata/sensai-dataset}
}
```

## License

- Code: [MIT License](https://github.com/holodata/sensai-dataset/blob/master/LICENSE)
- Dataset: [ODC Public Domain Dedication and Licence (PDDL)](https://opendatacommons.org/licenses/pddl/1-0/index.html)
feat: add base 2021-09-01 12:11:38 +09:00			`# ❤️‍🩹 Sensai: Toxic Chat Dataset`

feat: add proper datagen 2021-09-09 01:14:21 +09:00			`Sensai is a toxic chat dataset consists of live chats from Virtual YouTubers' live streams.`
feat: add base 2021-09-01 12:11:38 +09:00
feat: add proper datagen 2021-09-09 01:14:21 +09:00			`Download the dataset from [Huggingface Hub](https://huggingface.co/datasets/holodata/sensai) or alternatively from [Kaggle Datasets](https://www.kaggle.com/uetchy/sensai).`

			Join `#livechat-dataset` channel on [holodata Discord](https://holodata.org/discord) for discussions.
feat: add base 2021-09-01 12:11:38 +09:00
			`## Provenance`

			`- Source: YouTube Live Chat events (all streams covered by [Holodex](https://holodex.net), including Hololive, Nijisanji, 774inc, etc)`
			`- Temporal Coverage: From 2021-01-15T05:15:33Z`
			`- Update Frequency: At least once per month`

			`## Research Ideas`

			`- Toxic Chat Classification`
			`- Spam Detection`
			`- Sentence Transformer for Live Chats`

			`See [public notebooks](https://www.kaggle.com/uetchy/sensai/code) for ideas.`

			`## Files`

			`\| filename \| summary \| size \|`
			`\| ------------------------- \| -------------------------------------------------------------- \| -------- \|`
			\| `chats_flagged_%Y-%m.csv` \| Chats flagged as either deleted or banned by mods (3,100,000+) \| ~ 400 MB \|
feat: add proper datagen 2021-09-09 01:14:21 +09:00			\| `chats_nonflag_%Y-%m.csv` \| Non-flagged chats (3,100,000+) \| ~ 300 MB \|
feat: add base 2021-09-01 12:11:38 +09:00
			To make it a balanced dataset, the number of `chats_nonflags` is adjusted (randomly sampled) to be the same as `chats_flagged`.
			Ban and deletion are equivalent to `markChatItemsByAuthorAsDeletedAction` and `markChatItemAsDeletedAction` respectively.

			`## Dataset Breakdown`

refactor: add more examples 2022-06-03 16:36:32 +09:00			### Chats (`chats_%Y-%m.parquet`)
feat: add base 2021-09-01 12:11:38 +09:00
fix: add label 2021-11-01 14:21:38 +09:00			`\| column \| type \| description \|`
			`\| --------------- \| ------ \| ----------------------------- \|`
			`\| body \| string \| chat message \|`
			`\| authorChannelId \| string \| anonymized author channel id \|`
			`\| channelId \| string \| source channel id \|`
			`\| label \| string \| {deleted, hidden, nonflagged} \|`
feat: add proper datagen 2021-09-09 01:14:21 +09:00
			`## Usage`
feat: add base 2021-09-01 12:11:38 +09:00
refactor: add more examples 2022-06-03 16:36:32 +09:00			`### Pandas`

			```python
			`import pandas as pd`
			`from glob import glob`

			`df = pd.concat([pd.read_parquet(x) for x in glob('../input/sensai/*.parquet')], ignore_index=True)`
			```

			`### Huggingface Transformers`

feat: add proper datagen 2021-09-09 01:14:21 +09:00			`https://huggingface.co/docs/datasets/loading_datasets.html`
feat: add base 2021-09-01 12:11:38 +09:00
feat: add proper datagen 2021-09-09 01:14:21 +09:00			```python
			`# $ pip3 install datasets`
			`from datasets import load_dataset, Features, ClassLabel, Value`
feat: add base 2021-09-01 12:11:38 +09:00
feat: add proper datagen 2021-09-09 01:14:21 +09:00			`dataset = load_dataset("holodata/sensai",`
			`features=Features(`
			`{`
			`"body": Value("string"),`
			`"toxic": ClassLabel(num_classes=2, names=['0', '1'])`
			`}`
			`))`
feat: add base 2021-09-01 12:11:38 +09:00
feat: add proper datagen 2021-09-09 01:14:21 +09:00			`model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=2)`
			`tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")`
feat: add base 2021-09-01 12:11:38 +09:00
feat: add proper datagen 2021-09-09 01:14:21 +09:00			`def tokenize_function(examples):`
			`return tokenizer(examples["body"], padding="max_length", truncation=True)`

			`tokenized_datasets = dataset['train'].shuffle().select(range(50000)).map(tokenize_function, batched=True)`
			`tokenized_datasets.rename_column_("toxic", "label")`
			`splitset = tokenized_datasets.train_test_split(0.2)`
			`training_args = TrainingArguments("test_trainer")`

			`trainer = Trainer(`
			`model=model, args=training_args, train_dataset=splitset['train'], eval_dataset=splitset['test']`
			`)`

			`trainer.train()`
feat: add base 2021-09-01 12:11:38 +09:00			```

refactor: add more examples 2022-06-03 16:36:32 +09:00			`### Tangram`

			```bash
			`python3 ./examples/prepare_tangram_dataset.py`
			`tangram train --file ./tangram_input.csv --target label`
			```

feat: add base 2021-09-01 12:11:38 +09:00			`## Consideration`

			`### Anonymization`

chore: fix docs 2021-09-01 12:18:29 +09:00			`authorChannelId` are anonymized by SHA-1 hashing algorithm with a pinch of undisclosed salt.
feat: add base 2021-09-01 12:11:38 +09:00
			`### Handling Custom Emojis`

			All custom emojis are replaced with a Unicode replacement character `U+FFFD`.

			`## Citation`

			```latex
			`@misc{sensai-dataset,`
			`author={Yasuaki Uechi},`
chore: fix docs 2021-09-01 12:18:29 +09:00			`title={Sensai: Toxic Chat Dataset},`
feat: add base 2021-09-01 12:11:38 +09:00			`year={2021},`
			`month={8},`
			`version={31},`
			`url={https://github.com/holodata/sensai-dataset}`
			`}`
			```

			`## License`

			`- Code: [MIT License](https://github.com/holodata/sensai-dataset/blob/master/LICENSE)`
			`- Dataset: [ODC Public Domain Dedication and Licence (PDDL)](https://opendatacommons.org/licenses/pddl/1-0/index.html)`