Creating a chatbot that can understand and use a language other than English can be an ambitious task. Chatbots are still in their early days and even though there are many NLP libraries available nowadays, most of them support only the English language. From stopwords and POS taggers to pretrained word2vec models, it can be time-consuming to work on an NLP problem in a different language.
We at SmartCat tried a bit of a different approach in creating a bot that uses the Serbian language. Using a large dataset of unlabeled sentences written in Serbian and performing ML methods, we were able to create a chatbot that resulted in a very decent performance. It was able to return an expected response 9/10 times. Interested in how we did it? Keep on reading.
As with any other NLP problem, we started by processing our sentences. The processing steps included converting text to lowercase, removing numbers that were irrelevant, removing punctuation and stripping extra white spaces. All these methods can be performed on a text written in any language. If we had been working with sentences written in English we would have continued by performing stemming, lemmatization, removing sparse terms and stopwords, but this got us thinking. Knowing that stopwords occur very often in sentences (an example of such words in English would be words such as: a, an, the, such, as…) we decided to perform tf-idf statistic on our dataset. That way we created a large vocabulary of words in Serbian that had tf-idf index assigned to them. Words with lower tf-idf value were considered irrelevant - either stopwords or sparse terms. Words that had a higher tf-idf value were considered important. Stemming and lemmatization were not performed for two reasons - 1. there are many suffixes, many rules and even more exceptions in the Serbian language on when a suffix can be applied to a word stem; 2. there are a few dialects and many grammar errors in our dataset which could lead to wrong conclusions. We will iron out this issue in further steps.
Alright, now we have some processed sentences to work with. Usually we would continue by performing POS tagging so we can learn something about semantics. Unfortunately, there are no POS taggers available for Serbian language and creating one would take too much of our time. That is why we opted for creating a word2vec model. Even though our dataset wasn’t perfect, we were positive that it would give us some insight on word similarity and relations. If you are interested in learning more about how word2vec can do this, you can check out our blog post. We trained our model using around 400.000 sentences, created word embeddings and then continued by clustering retrieved embeddings using K-Means. It worked like a charm - it was clear that there was a cluster that contained greeting phrases, the one that contained personal information such as names or ID numbers, the one that contained gratitude phrases, complaints, problem reports etc. After tuning our algorithm and creating subclusters where needed, we were able to label our large dataset in no time. Now we were aware of the semantics which was what we needed to generate adequate responses.
Creating a ChatBot and results
We used the clustering algorithm only as a classification tool, as we believed that the classification model trained on our labeled dataset would perform better over time. In our tests, random forest resulted in the best accuracy.
Bot: Hello! Thank you for contacting us. How can I help you?
User: why do you have frequent issues with your website
Bot: We will check immediately with a competent service so we can give you the right information.
User: ok, thank you
Bot: Thank you for contacting us. If you have any further questions, we will be pleased to answer.
Whenever a new message is received, the chatbot would process it, classify it, and return the response that was assigned to the predicted class if the prediction accuracy was higher than the threshold we set. We tested our bot on 100 previously conducted conversations. These conversations contained 210 messages from users in total. Out of 210, 23 messages could not be handled by the bot and 11 messages were not handled correctly. That means that in 84% of cases the bot was able to respond and in 95% of cases the bot was able to handle the conversations correctly.
We were able to achieve even better results using n-grams. We separated every message received into n-grams where n can take values in range 1..no_of_words. That way we were able to detect multiple classes for each message and give a better response. For example, a sentence could contain both a greeting phrase and a question. This approach does require more time to process each message, but it could be applicable to use cases where immediate response is not of great importance.
We hope we gave you an idea on how to handle your non-English speaking bot. Have you tried any other approach? Let us know in the comments, we would be happy to discuss.