From fe6fed0759d4f044b21143c015dd28ea1e570242 Mon Sep 17 00:00:00 2001 From: Yasuaki Uechi Date: Wed, 1 Sep 2021 12:18:29 +0900 Subject: [PATCH] chore: fix docs --- LICENSE | 19 +++++++++++++++++++ README.md | 37 ++++++++++++------------------------- 2 files changed, 31 insertions(+), 25 deletions(-) create mode 100644 LICENSE diff --git a/LICENSE b/LICENSE new file mode 100644 index 0000000..d175da2 --- /dev/null +++ b/LICENSE @@ -0,0 +1,19 @@ +Copyright (c) 2021 Yasuaki Uechi + +Permission is hereby granted, free of charge, to any person obtaining a copy +of this software and associated documentation files (the "Software"), to deal +in the Software without restriction, including without limitation the rights +to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +copies of the Software, and to permit persons to whom the Software is +furnished to do so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in all +copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, +EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF +MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. +IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, +DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR +OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE +OR OTHER DEALINGS IN THE SOFTWARE. diff --git a/README.md b/README.md index 10fee6d..e367a2e 100644 --- a/README.md +++ b/README.md @@ -34,12 +34,9 @@ Ban and deletion are equivalent to `markChatItemsByAuthorAsDeletedAction` and `m | column | type | description | | --------------- | ------ | ---------------------------- | -| timestamp | string | UTC timestamp | | body | string | chat message | | membership | string | membership status | -| id | string | anonymized chat id | | authorChannelId | string | anonymized author channel id | -| videoId | string | source video id | | channelId | string | source channel id | #### Membership status @@ -60,33 +57,23 @@ Ban and deletion are equivalent to `markChatItemsByAuthorAsDeletedAction` and `m Set `keep_default_na` to `False` and `na_values` to `''` in `read_csv`. Otherwise, chat message like `NA` would incorrectly be treated as NaN value. ```python -chats = pd.read_csv('../input/vtuber-livechat/chats_2021-03.csv', - na_values='', - keep_default_na=False, - index_col='timestamp', - parse_dates=True) +import pandas as pd +from glob import iglob + +flagged = pd.concat([ + pd.read_csv(f, + na_values='', + keep_default_na=False) + for f in iglob('../input/sensai/chats_flagged_*.csv') +], + ignore_index=True) ``` -### Channels (`channels.csv`) - -| column | type | description | -| ----------------- | --------------- | ---------------------- | -| channelId | string | channel id | -| name | string | channel name | -| englishName | nullable string | channel name (English) | -| affiliation | string | channel affiliation | -| group | nullable string | group | -| subscriptionCount | number | subscription count | -| videoCount | number | uploads count | -| photo | string | channel icon | - -Inactive channels have `INACTIVE` in `group` column. - ## Consideration ### Anonymization -`id` and `channelId` are anonymized by SHA-1 hashing algorithm with a pinch of undisclosed salt. +`authorChannelId` are anonymized by SHA-1 hashing algorithm with a pinch of undisclosed salt. ### Handling Custom Emojis @@ -97,7 +84,7 @@ All custom emojis are replaced with a Unicode replacement character `U+FFFD`. ```latex @misc{sensai-dataset, author={Yasuaki Uechi}, - title={Sensai: Large Scale Virtual YouTubers Live Chat Dataset}, + title={Sensai: Toxic Chat Dataset}, year={2021}, month={8}, version={31},