sensai/notebooks/playground.ipynb

12 KiB
Vendored

In [1]:
from datasets import load_dataset, Features
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
import os
from os.path import join
import pandas as pd
from datasets import ClassLabel, Value

# https://huggingface.co/docs/datasets/loading_datasets.html

DATASET_DIR = os.environ['DATASET_DIR']
In [2]:
dataset = load_dataset("holodata/sensai", features=Features(
                {
                    "body": Value("string"),
                    "toxic": ClassLabel(num_classes=2, names=['0', '1'])
                }
            ))
dataset = dataset['train']
  0%|          | 0/1 [00:00<?, ?it/s]
Using custom data configuration sensai-4d9ed81389161083
Reusing dataset parquet (/home/uetchy/.cache/huggingface/datasets/parquet/sensai-4d9ed81389161083/0.0.0/9296ce43568b20d72ff8ff8ecbc821a16b68e9b8b7058805ef11f06e035f911a)
  0%|          | 0/1 [00:00<?, ?it/s]
In [3]:
dataset.features
Out[3]:
{'body': Value(dtype='string', id=None),
 'toxic': ClassLabel(num_classes=2, names=['0', '1'], names_file=None, id=None)}
In [4]:
model = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=2)
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
In [5]:
samples = dataset.shuffle().select(range(50000))

def tokenize_function(examples):
    return tokenizer(examples["body"], padding="max_length", truncation=True)

tokenized_datasets = samples.map(tokenize_function, batched=True)
tokenized_datasets.rename_column_("toxic", "label")
Loading cached shuffled indices for dataset at /home/uetchy/.cache/huggingface/datasets/parquet/sensai-4d9ed81389161083/0.0.0/9296ce43568b20d72ff8ff8ecbc821a16b68e9b8b7058805ef11f06e035f911a/cache-24e3dd769ef2f1b7.arrow
Loading cached processed dataset at /home/uetchy/.cache/huggingface/datasets/parquet/sensai-4d9ed81389161083/0.0.0/9296ce43568b20d72ff8ff8ecbc821a16b68e9b8b7058805ef11f06e035f911a/cache-8395f066c72e57d7.arrow
/tmp/ipykernel_4082765/2982913603.py:7: FutureWarning: rename_column_ is deprecated and will be removed in the next major version of datasets. Use Dataset.rename_column instead.
  tokenized_datasets.rename_column_("toxic", "label")
In [6]:
splitset = tokenized_datasets.train_test_split(0.2)
splitset
Out[6]:
DatasetDict({
    train: Dataset({
        features: ['attention_mask', 'body', 'input_ids', 'token_type_ids', 'label'],
        num_rows: 40000
    })
    test: Dataset({
        features: ['attention_mask', 'body', 'input_ids', 'token_type_ids', 'label'],
        num_rows: 10000
    })
})
In [7]:
training_args = TrainingArguments("test_trainer")
trainer = Trainer(
    model=model, args=training_args, train_dataset=splitset['train'], eval_dataset=splitset['test']
)
In [9]:
trainer.train(resume_from_checkpoint=True)
Loading model from test_trainer/checkpoint-12500).
The following columns in the training set  don't have a corresponding argument in `BertForSequenceClassification.forward` and have been ignored: body.
***** Running training *****
  Num examples = 40000
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 15000
  Continuing training from checkpoint, will skip to saved global_step
  Continuing training from epoch 2
  Continuing training from global step 12500
  Will skip the first 2 epochs then the first 2500 batches in the first epoch. If this takes a lot of time, you can add the `--ignore_data_skip` flag to your launch command, but you will resume the training on data already seen by your model.
  0%|          | 0/2500 [00:00<?, ?it/s]
Didn't find an RNG file, if you are resuming a training that was launched in a distributed fashion, reproducibility is not guaranteed.
[15000/15000 31:45, Epoch 3/3]
Step Training Loss
13000 0.687500
13500 0.686300
14000 0.637900
14500 0.643200
15000 0.627700

Saving model checkpoint to test_trainer/checkpoint-13000
Configuration saved in test_trainer/checkpoint-13000/config.json
Model weights saved in test_trainer/checkpoint-13000/pytorch_model.bin
Saving model checkpoint to test_trainer/checkpoint-13500
Configuration saved in test_trainer/checkpoint-13500/config.json
Model weights saved in test_trainer/checkpoint-13500/pytorch_model.bin
Saving model checkpoint to test_trainer/checkpoint-14000
Configuration saved in test_trainer/checkpoint-14000/config.json
Model weights saved in test_trainer/checkpoint-14000/pytorch_model.bin
Saving model checkpoint to test_trainer/checkpoint-14500
Configuration saved in test_trainer/checkpoint-14500/config.json
Model weights saved in test_trainer/checkpoint-14500/pytorch_model.bin
Saving model checkpoint to test_trainer/checkpoint-15000
Configuration saved in test_trainer/checkpoint-15000/config.json
Model weights saved in test_trainer/checkpoint-15000/pytorch_model.bin


Training completed. Do not forget to share your model on huggingface.co/models =)


Out[9]:
TrainOutput(global_step=15000, training_loss=0.10941998901367188, metrics={'train_runtime': 1918.0916, 'train_samples_per_second': 62.562, 'train_steps_per_second': 7.82, 'total_flos': 3.24994775580672e+16, 'train_loss': 0.10941998901367188, 'epoch': 3.0})
In [ ]: