Multilingual Toxic Comment Classification



The problem statement:

• Train on a english toxic comments dataset and use a multilingual model to test toxicity of comments in other languages
• In simple terms, rather than just flagging toxic words, the model had to "understand the essence" of the sentence along with the context

What we did:

• Data exploration and feature mapping, unfortunately there wasnt much that could be done here, dataset was limiting. So we,
• Created our own datasets, via translation models.
• We used language specific models on the translated versions of the English dataset and blended results over many many many hyperparameter tuning rounds
• Used a wide variety of models to preserve generality. One blend even included a scikit-learn SVM model!
• Weeks and weeks of training, validation and testing.

Results:

• Our team Sentimental, was awarded a silver medal 🥈 as we ranked 14th amongst 1600+ teams!
My profile on Kaggle was promoted to Competitions Expert