mirror of
https://github.com/holodata/sensai-dataset.git
synced 2025-03-15 20:10:32 +09:00
chore: fix docs
This commit is contained in:
parent
e241a46bfc
commit
fe6fed0759
19
LICENSE
Normal file
19
LICENSE
Normal file
@ -0,0 +1,19 @@
|
|||||||
|
Copyright (c) 2021 Yasuaki Uechi <y@uechi.io>
|
||||||
|
|
||||||
|
Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||||
|
of this software and associated documentation files (the "Software"), to deal
|
||||||
|
in the Software without restriction, including without limitation the rights
|
||||||
|
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
||||||
|
copies of the Software, and to permit persons to whom the Software is
|
||||||
|
furnished to do so, subject to the following conditions:
|
||||||
|
|
||||||
|
The above copyright notice and this permission notice shall be included in all
|
||||||
|
copies or substantial portions of the Software.
|
||||||
|
|
||||||
|
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
|
||||||
|
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
|
||||||
|
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
|
||||||
|
IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM,
|
||||||
|
DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR
|
||||||
|
OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE
|
||||||
|
OR OTHER DEALINGS IN THE SOFTWARE.
|
35
README.md
35
README.md
@ -34,12 +34,9 @@ Ban and deletion are equivalent to `markChatItemsByAuthorAsDeletedAction` and `m
|
|||||||
|
|
||||||
| column | type | description |
|
| column | type | description |
|
||||||
| --------------- | ------ | ---------------------------- |
|
| --------------- | ------ | ---------------------------- |
|
||||||
| timestamp | string | UTC timestamp |
|
|
||||||
| body | string | chat message |
|
| body | string | chat message |
|
||||||
| membership | string | membership status |
|
| membership | string | membership status |
|
||||||
| id | string | anonymized chat id |
|
|
||||||
| authorChannelId | string | anonymized author channel id |
|
| authorChannelId | string | anonymized author channel id |
|
||||||
| videoId | string | source video id |
|
|
||||||
| channelId | string | source channel id |
|
| channelId | string | source channel id |
|
||||||
|
|
||||||
#### Membership status
|
#### Membership status
|
||||||
@ -60,33 +57,23 @@ Ban and deletion are equivalent to `markChatItemsByAuthorAsDeletedAction` and `m
|
|||||||
Set `keep_default_na` to `False` and `na_values` to `''` in `read_csv`. Otherwise, chat message like `NA` would incorrectly be treated as NaN value.
|
Set `keep_default_na` to `False` and `na_values` to `''` in `read_csv`. Otherwise, chat message like `NA` would incorrectly be treated as NaN value.
|
||||||
|
|
||||||
```python
|
```python
|
||||||
chats = pd.read_csv('../input/vtuber-livechat/chats_2021-03.csv',
|
import pandas as pd
|
||||||
|
from glob import iglob
|
||||||
|
|
||||||
|
flagged = pd.concat([
|
||||||
|
pd.read_csv(f,
|
||||||
na_values='',
|
na_values='',
|
||||||
keep_default_na=False,
|
keep_default_na=False)
|
||||||
index_col='timestamp',
|
for f in iglob('../input/sensai/chats_flagged_*.csv')
|
||||||
parse_dates=True)
|
],
|
||||||
|
ignore_index=True)
|
||||||
```
|
```
|
||||||
|
|
||||||
### Channels (`channels.csv`)
|
|
||||||
|
|
||||||
| column | type | description |
|
|
||||||
| ----------------- | --------------- | ---------------------- |
|
|
||||||
| channelId | string | channel id |
|
|
||||||
| name | string | channel name |
|
|
||||||
| englishName | nullable string | channel name (English) |
|
|
||||||
| affiliation | string | channel affiliation |
|
|
||||||
| group | nullable string | group |
|
|
||||||
| subscriptionCount | number | subscription count |
|
|
||||||
| videoCount | number | uploads count |
|
|
||||||
| photo | string | channel icon |
|
|
||||||
|
|
||||||
Inactive channels have `INACTIVE` in `group` column.
|
|
||||||
|
|
||||||
## Consideration
|
## Consideration
|
||||||
|
|
||||||
### Anonymization
|
### Anonymization
|
||||||
|
|
||||||
`id` and `channelId` are anonymized by SHA-1 hashing algorithm with a pinch of undisclosed salt.
|
`authorChannelId` are anonymized by SHA-1 hashing algorithm with a pinch of undisclosed salt.
|
||||||
|
|
||||||
### Handling Custom Emojis
|
### Handling Custom Emojis
|
||||||
|
|
||||||
@ -97,7 +84,7 @@ All custom emojis are replaced with a Unicode replacement character `U+FFFD`.
|
|||||||
```latex
|
```latex
|
||||||
@misc{sensai-dataset,
|
@misc{sensai-dataset,
|
||||||
author={Yasuaki Uechi},
|
author={Yasuaki Uechi},
|
||||||
title={Sensai: Large Scale Virtual YouTubers Live Chat Dataset},
|
title={Sensai: Toxic Chat Dataset},
|
||||||
year={2021},
|
year={2021},
|
||||||
month={8},
|
month={8},
|
||||||
version={31},
|
version={31},
|
||||||
|
Loading…
x
Reference in New Issue
Block a user