This repo includes a file to preprocess your data. Unfortunately it is extra spaghetti code, but decently usable.
The data I used for train data can be found here Unfortunately, the discord (validation) data used was from various servers, and I was asked not to release it.
- Your text file(s) are assumed to be person to person conversation, with line 1 being the first person and line 2 being the second person.
- Lines end with
\n
. - There is only one personality to follow.
- You have a validation set & a training set, in seperate text files.
In avg_length.py
:
- Change the file read in line
2
to each of your 2 text files - Estimate the average length and standard devation of sentences in your dataset. Try and normalize the data. A generally good formula is
(Average + Standard Deviation) + 5
In preprocess.py
:
- On lines
38
and58
change the max length you found before fromavg_length.py
- On lines
70
adn84
change the personality to use, unless you're ok withjade
being the personality - On lines
10
and14
, manually modify the text files to read from.inp.txt
contains input data that the bot should attempt to mimic (your goal personalitites)out.txt
is the validation data. Preferrably, it comes from the audience the chatbot will be interacting with. Some of this data will e included in the training set.
- On lines
16-17
, change the number of lines to read.
- Each entry is a dict with two keys personality and utterances, the dataset is a list of entries. In each entry, there is a
- personality: list of strings containing the personality of the agent. This data is not interpreted, and serves primarily as a "marker". As such, multiple of the same key can be trained on to improve the overall model.
- utterances: list of dictionaries, each of which has two keys which are lists of strings.
- candidates: [next_utterance_candidate_1, ..., next_utterance_candidate_19] The last candidate is the ground truth response observed in the conversational data
- history: [dialog_turn_0, ... dialog_turn N], where N is an odd number since the other user starts every conversation.
- Preprocessing:
- Spaces before periods/punctuation at end of sentences
- Everything lowercase
- All data should be
a-z, A-Z, 0-9 ".", "?", "!", ",", "'", " ", "/", "%", "#"
- Keep the validation set small!! the training script will go through all the validation samples before saving the model.