Multilingual Toxic Comment Classification
The problem statement:
• Train on a english toxic comments dataset and use a multilingual model to test toxicity of comments in other languages
• In simple terms, rather than just flagging toxic words, the model had to "understand the essence" of the sentence along with the context
• Data exploration and feature mapping, unfortunately there wasnt much that could be done here, dataset was limiting. So we,
• Created our own datasets, via translation models.
• We used language specific models on the translated versions of the English dataset and blended results over many many many hyperparameter tuning rounds
• Used a wide variety of models to preserve generality. One blend even included a scikit-learn SVM model!
• Weeks and weeks of training, validation and testing.
• Our team Sentimental, was awarded a silver medal 🥈 as we ranked 14th amongst 1600+ teams!
• My profile on Kaggle was promoted to Competitions Expert