chore: update infos

This commit is contained in:
uetchy 2021-09-02 23:44:18 +09:00
parent aed3a390b8
commit 7c643c67cd
5 changed files with 456 additions and 19 deletions

18
.gitattributes vendored
View File

@ -25,19 +25,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
*.zip filter=lfs diff=lfs merge=lfs -text
*.zstandard filter=lfs diff=lfs merge=lfs -text
*tfevents* filter=lfs diff=lfs merge=lfs -text
chats_flagged_2021-06.csv filter=lfs diff=lfs merge=lfs -text
chats_nonflag_2021-02.csv filter=lfs diff=lfs merge=lfs -text
chats_nonflag_2021-06.csv filter=lfs diff=lfs merge=lfs -text
chats_flagged_2021-03.csv filter=lfs diff=lfs merge=lfs -text
chats_nonflag_2021-04.csv filter=lfs diff=lfs merge=lfs -text
chats_nonflag_2021-07.csv filter=lfs diff=lfs merge=lfs -text
chats_flagged_2021-07.csv filter=lfs diff=lfs merge=lfs -text
chats_flagged_2021-02.csv filter=lfs diff=lfs merge=lfs -text
chats_flagged_2021-04.csv filter=lfs diff=lfs merge=lfs -text
chats_nonflag_2021-03.csv filter=lfs diff=lfs merge=lfs -text
chats_nonflag_2021-08.csv filter=lfs diff=lfs merge=lfs -text
chats_flagged_2021-01.csv filter=lfs diff=lfs merge=lfs -text
chats_flagged_2021-08.csv filter=lfs diff=lfs merge=lfs -text
chats_nonflag_2021-01.csv filter=lfs diff=lfs merge=lfs -text
chats_nonflag_2021-05.csv filter=lfs diff=lfs merge=lfs -text
chats_flagged_2021-05.csv filter=lfs diff=lfs merge=lfs -text
*.csv filter=lfs diff=lfs merge=lfs -text

145
.gitignore vendored Normal file
View File

@ -0,0 +1,145 @@
# Created by https://www.toptal.com/developers/gitignore/api/python
# Edit at https://www.toptal.com/developers/gitignore?templates=python
### Python ###
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
# C extensions
*.so
# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec
# Installer logs
pip-log.txt
pip-delete-this-directory.txt
# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
cover/
# Translations
*.mo
*.pot
# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal
# Flask stuff:
instance/
.webassets-cache
# Scrapy stuff:
.scrapy
# Sphinx documentation
docs/_build/
# PyBuilder
.pybuilder/
target/
# Jupyter Notebook
.ipynb_checkpoints
# IPython
profile_default/
ipython_config.py
# pyenv
# For a library or package, you might want to ignore these files since the code is
# intended to run in multiple environments; otherwise, check them in:
# .python-version
# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock
# PEP 582; used by e.g. github.com/David-OConnor/pyflow
__pypackages__/
# Celery stuff
celerybeat-schedule
celerybeat.pid
# SageMath parsed files
*.sage.py
# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/
# Spyder project settings
.spyderproject
.spyproject
# Rope project settings
.ropeproject
# mkdocs documentation
/site
# mypy
.mypy_cache/
.dmypy.json
dmypy.json
# Pyre type checker
.pyre/
# pytype static type analyzer
.pytype/
# Cython debug symbols
cython_debug/
# End of https://www.toptal.com/developers/gitignore/api/python

288
README.md Normal file
View File

@ -0,0 +1,288 @@
![Header](https://github.com/holodata/vtuber-livechat-dataset/blob/master/.github/kaggle-dataset-header.png?raw=true)
# VTuber 500M: Live Chat and Moderation Events
VTuber 500M is a huge collection of hundreds of millions of live chat, super chat, and moderation events (ban and deletion) all across Virtual YouTubers' live streams, ready for academic research and any kind of NLP projects.
Download the dataset from [Kaggle Datasets](https://www.kaggle.com/uetchy/vtuber-livechat) and join `#livechat-dataset` channel on [holodata Discord](https://holodata.org/discord) for discussions.
## Provenance
- **Source:** YouTube live chat events collected by our [Honeybee](https://github.com/holodata/honeybee) cluster. [Holodex](https://holodex.net) is a stream index provider for Honeybee which covers Hololive, Nijisanji, 774inc, etc.
- **Temporal Coverage:**
- Chats: from 2021-01-15T05:15:33Z
- Superchats: from 2021-03-16T08:19:38Z
- **Update Frequency:**
- At least once per month
## Research Ideas
- Toxic Chat Classification
- Spam Detection
- Demographic Visualization
- Superchat Analysis
- Sentence Transformer for Live Chats
See [public notebooks](https://www.kaggle.com/uetchy/vtuber-livechat/code) for ideas.
> We employed [Honeybee](https://github.com/holodata/honeybee) cluster to collect real-time live chat events across major Vtubers' live streams. All sensitive data such as author name or author profile image are omitted from the dataset, and author channel id is anonymized by SHA-1 hashing algorithm with a grain of salt.
## Versions
### Standard version
Standard version is available at [Kaggle Datasets](https://www.kaggle.com/uetchy/vtuber-livechat).
| filename | summary | size |
| ---------------------- | -------------------------------- | -------- |
| `channels.csv` | Channel index | < 1 MB |
| `chat_stats.csv` | Chat statistics | < 1 MB |
| `superchat_stats.csv` | Super Chat statistics | < 1 MB |
| `chats_%Y-%m.csv` | Live chat events (~ 500,000,000) | ~ 50 GB |
| `superchats_%Y-%m.csv` | Super chat events (~ 2,000,000) | ~ 200 MB |
| `deletion_events.csv` | Deletion events | ~ 150 MB |
| `ban_events.csv` | Ban events | ~ 25 MB |
### Full version
Full version is only available to those approved by the admins. If you are interested in conducting research or analysis using the dataset, please reach us at `#livechat-dataset` channel on [holodata Discord server](https://holodata.org/discord) or at `uechiy@acm.org` (for organizations).
| filename | summary | size |
| ---------------------- | ---------------------------------- | -------- |
| `channels.csv` | Channel index | < 1 MB |
| `chat_stats.csv` | Chat statistics | < 1 MB |
| `superchat_stats.csv` | Super Chat statistics | < 1 MB |
| `chats_%Y-%m.csv` | Live chat messages (~ 500,000,000) | ~ 90 GB |
| `superchats_%Y-%m.csv` | Super chat messages (~ 2,000,000) | ~ 400 MB |
| `deletion_events.csv` | Deletion events | ~ 150 MB |
| `ban_events.csv` | Ban events | ~ 25 MB |
### [❤️‍🩹 Sensai](https://github.com/holodata/sensai-dataset)
Sensai is a toxic chat dataset consists of live chats from Virtual YouTubers' live streams.
| filename | summary | size |
| ------------------------- | -------------------------------------------------------------- | -------- |
| `chats_flagged_%Y-%m.csv` | Chats flagged as either deleted or banned by mods (3,100,000+) | ~ 400 MB |
| `chats_nonflag_%Y-%m.csv` | Non-flagged chats (3,000,000+) | ~ 300 MB |
## Dataset Breakdown
> Ban and deletion are equivalent to `markChatItemsByAuthorAsDeletedAction` and `markChatItemAsDeletedAction` respectively.
### Channels (`channels.csv`)
| column | type | description |
| ----------------- | --------------- | ---------------------- |
| channelId | string | channel id |
| name | string | channel name |
| englishName | nullable string | channel name (English) |
| affiliation | string | channel affiliation |
| group | nullable string | group |
| subscriptionCount | number | subscription count |
| videoCount | number | uploads count |
| photo | string | channel icon |
Inactive channels have `INACTIVE` in `group` column.
### Chat Statistics (`chat_stats.csv`)
| column | type | description |
| -------------- | ------ | -------------------------------------------------- |
| channelId | string | channel id |
| period | string | interested period (%Y-%M) |
| chats | number | number of chats |
| memberChats | number | number of chats with membership status attached |
| uniqueChatters | number | number of unique chatters |
| uniqueMembers | number | number of unique members appeared on live chat |
| bannedChatters | number | number of unique chatters marked as banned by mods |
| deletedChats | number | number of chats deleted by mods |
### Super Chat Statistics (`superchat_stats.csv`)
| column | type | description |
| -------------------- | ------ | ---------------------------------- |
| channelId | string | channel id |
| period | string | interested period (%Y-%M) |
| superChats | number | number of super chats |
| uniqueSuperChatters | number | number of unique super chatters |
| totalSC | number | total amount of super chats (JPY) |
| averageSC | number | average amount of super chat (JPY) |
| totalMessageLength | number | total message length |
| averageMessageLength | number | average mesage length |
| mostFrequentCurrency | string | most frequent currency |
| mostFrequentColor | string | most frequent color |
### Chats (`chats_%Y-%m.csv`)
| column | type | description | in standard version |
| --------------- | ---------------- | ---------------------------- | ------------------------ |
| timestamp | string | ISO 8601 UTC timestamp | seconds are omitted |
| id | string | anonymized chat id | N/A |
| authorChannelId | string | anonymized author channel id | |
| channelId | string | source channel id | |
| videoId | string | source video id | |
| body | string | chat message | N/A |
| membership | string | membership status | N/A |
| isMember | nullable boolean | is member (null if unknown) | only in standard version |
| isModerator | boolean | is channel moderator | N/A |
| isVerified | boolean | is verified account | N/A |
#### Membership status
| value | duration |
| ----------------- | ------------------------- |
| unknown | Indistinguishable |
| non-member | 0 |
| less than 1 month | < 1 month |
| 1 month | >= 1 month, < 2 months |
| 2 months | >= 2 months, < 6 months |
| 6 months | >= 6 months, < 12 months |
| 1 year | >= 12 months, < 24 months |
| 2 years | >= 24 months |
#### Pandas usage
Set `keep_default_na` to `False` and `na_values` to `''` in `read_csv`. Otherwise, chat message like `NA` would incorrectly be treated as NaN value.
```python
chats = pd.read_csv('../input/vtuber-livechat/chats_2021-03.csv',
na_values='',
keep_default_na=False,
index_col='timestamp',
parse_dates=True)
```
### Superchats (`chats_:year:-:month:.csv`)
| column | type | description | in standard version |
| --------------- | --------------- | ---------------------------- | ------------------- |
| timestamp | string | ISO 8601 UTC timestamp | seconds are omitted |
| amount | number | purchased amount | |
| currency | string | three-letter currency symbol | |
| color | string | color | N/A |
| significance | number | significance | |
| body | nullable string | chat message | N/A |
| id | string | anonymized chat id | N/A |
| authorChannelId | string | anonymized author channel id | |
| videoId | string | source video id | N/A |
| channelId | string | source channel id | |
#### Color and Significance
| color | significance | purchase amount (¥) | purchase amount ($) | max. message length |
| --------- | ------------ | ------------------- | ------------------- | ------------------- |
| blue | 1 | ¥ 100 - 199 | $ 1.00 - 1.99 | 0 |
| lightblue | 2 | ¥ 200 - 499 | $ 2.00 - 4.99 | 50 |
| green | 3 | ¥ 500 - 999 | $ 5.00 - 9.99 | 150 |
| yellow | 4 | ¥ 1000 - 1999 | $ 10.00 - 19.99 | 200 |
| orange | 5 | ¥ 2000 - 4999 | $ 20.00 - 49.99 | 225 |
| magenta | 6 | ¥ 5000 - 9999 | $ 50.00 - 99.99 | 250 |
| red | 7 | ¥ 10000 - 50000 | $ 100.00 - 500.00 | 270 - 350 |
#### Pandas usage
Set `keep_default_na` to `False` and `na_values` to `''` in `read_csv`. Otherwise, chat message like `NA` would incorrectly be treated as NaN value.
```python
import pandas as pd
from glob import iglob
sc = pd.concat([
pd.read_csv(f,
na_values='',
keep_default_na=False,
index_col='timestamp',
parse_dates=True)
for f in iglob('../input/vtuber-livechat/superchats_*.csv')
],
ignore_index=False)
sc.sort_index(inplace=True)
```
### Deletion Events (`deletion_events.csv`)
| column | type | description |
| --------- | ------- | ---------------------------- |
| timestamp | string | UTC timestamp |
| id | string | anonymized chat id |
| retracted | boolean | is deleted by author oneself |
| videoId | string | source video id |
| channelId | string | source channel id |
#### Pandas usage
Insert `deleted_by_mod` column to `chats` DataFrame:
```python
chats = pd.read_csv('../input/vtuber-livechat/chats_2021-03.csv',
na_values='',
keep_default_na=False)
delet = pd.read_csv('../input/vtuber-livechat/deletion_events.csv',
usecols=['id', 'retracted'])
delet = delet[delet['retracted'] == 0]
delet['deleted_by_mod'] = True
chats = pd.merge(chats, delet[['id', 'deleted_by_mod']], how='left')
chats['deleted_by_mod'].fillna(False, inplace=True)
```
### Ban Events (`ban_events.csv`)
Here **Ban** means either to place user in time out or to permanently hide the user's comments on the channel's current and future live streams. This mixup is due to the fact that these actions are indistinguishable from others with the extracted data from `markChatItemsByAuthorAsDeletedAction` event.
| column | type | description |
| --------------- | ------ | --------------------- |
| timestamp | string | UTC timestamp |
| authorChannelId | string | anonymized channel id |
| videoId | string | source video id |
| channelId | string | source channel id |
#### Pandas usage
Insert `banned` column to `chats` DataFrame:
```python
chats = pd.read_csv('../input/vtuber-livechat/chats_2021-03.csv',
na_values='',
keep_default_na=False)
ban = pd.read_csv('../input/vtuber-livechat/ban_events.csv',
usecols=['authorChannelId', 'videoId'])
ban['banned'] = True
chats = pd.merge(chats, ban, on=['authorChannelId', 'videoId'], how='left')
chats['banned'].fillna(False, inplace=True)
```
## Consideration
### Anonymization
`id` and `authorChannelId` are anonymized by SHA-1 hashing algorithm with a pinch of undisclosed salt.
### Handling Custom Emojis
All custom emojis are replaced with a Unicode replacement character `U+FFFD`.
### Redundant Ban and Deletion Events
Bans and deletions from multiple moderators for the same person or chat will be logged separately. For simplicity, you can safely ignore all but the first line recorded in time order.
## Citation
```latex
@misc{vtuber-livechat-dataset,
author={Yasuaki Uechi},
title={VTuber 500M: Large Scale Virtual YouTubers Live Chat Dataset},
year={2021},
month={3},
version={31},
url={https://github.com/holodata/vtuber-livechat-dataset}
}
```
## License
- Code: [MIT License](https://github.com/holodata/vtuber-livechat-dataset/blob/master/LICENSE)
- Dataset: [ODC Public Domain Dedication and Licence (PDDL)](https://opendatacommons.org/licenses/pddl/1-0/index.html)

View File

@ -1,3 +0,0 @@
{
"id": "uetchy/sensai"
}

21
dataset_infos.json Normal file
View File

@ -0,0 +1,21 @@
{
"default": {
"description": "VTuber 500M: Live Chat and Moderation Events",
"homepage": "https://github.com/holodata/sensai-dataset",
"license": "PDDL",
"splits": {
"train": {
"name": "train",
"num_bytes": 385090,
"num_examples": 5452,
"dataset_name": "trec"
},
"test": {
"name": "test",
"num_bytes": 27983,
"num_examples": 500,
"dataset_name": "trec"
}
}
}
}