Journal IJCRT UGC-CARE, UGCCARE( ISSN: 2320-2882 ) | UGC Approved Journal | UGC Journal | UGC CARE Journal | UGC-CARE list, New UGC-CARE Reference List, UGC CARE Journals, International Peer Reviewed Journal and Refereed Journal, ugc approved journal, UGC CARE, UGC CARE list, UGC CARE list of Journal, UGCCARE, care journal list, UGC-CARE list, New UGC-CARE Reference List, New ugc care journal list, Research Journal, Research Journal Publication, Research Paper, Low cost research journal, Free of cost paper publication in Research Journal, High impact factor journal, Journal, Research paper journal, UGC CARE journal, UGC CARE Journals, ugc care list of journal, ugc approved list, ugc approved list of journal, Follow ugc approved journal, UGC CARE Journal, ugc approved list of journal, ugc care journal, UGC CARE list, UGC-CARE, care journal, UGC-CARE list, Journal publication, ISSN approved, Research journal, research paper, research paper publication, research journal publication, high impact factor, free publication, index journal, publish paper, publish Research paper, low cost publication, ugc approved journal, UGC CARE, ugc approved list of journal, ugc care journal, UGC CARE list, UGCCARE, care journal, UGC-CARE list, New UGC-CARE Reference List, UGC CARE Journals, ugc care list of journal, ugc care list 2020, ugc care approved journal, ugc care list 2020, new ugc approved journal in 2020, ugc care list 2021, ugc approved journal in 2021, Scopus, web of Science.
How start New Journal & software Book & Thesis Publications
Submit Your Paper
Login to Author Home
Communication Guidelines

WhatsApp Contact
Click Here

  Published Paper Details:

  Paper Title

Fine-Tuning Of Distilbert For Gujarati-English Code-Mixed Language Identification In Resource Constrained Environment

  Authors

  Chirag D. Shah,  Dr. Shailesh A. Chaudhari

  Keywords

DistilBERT, NLP, LID

  Abstract


The ever-expanding use of code-mixed language, mainly on social media platforms, has resulted in challenges for natural language processing (NLP) tasks, due to its irregular writing pattern and the application of multiple languages within a sentence or phrase. Code-mixing of Gujarati-English is increasingly common in multilingual communities, as Gujarati diaspora around the world switch between their mother tongue and English while commenting/twitting/posting their views. In this paper, we present an efficient solution for word level language identification for low resource scenarios by application of fine-tuned version of DistilBERT--a lightweight transformer-based model. Our dataset consists of code-mixed social media comments from YouTube with each word annotated as one of three language tags: Gujarati, English, or Other. It comprises of 77,761 annotated sentences containing 732,917 words, with language labels distributed as 56.06% Gujarati (GJ), 36.77% English (EN), and 7.10% Other (OT). The distinctive part of this work is its fine-tuning process which is entirely conducted using CPU by dividing the training data into chunks of 1000 sentences each. This chunk-based training allows the large dataset to be processed in incremented versions by preserving the optimizer and scheduler states through different iterations. The proposed model gained an accuracy of 97.09%, precision of 97.02%, recall of 97.09%, and F1-score of 97.01%. These results outperform our baseline ML based Random Forest Model which was trained on hand crafted features and achieved accuracy of 91.2%. This proves the effectiveness of transformer-based fine-tuning for language identification in code-mixed contexts.

  IJCRT's Publication Details

  Unique Identification Number - IJCRT2512826

  Paper ID - 299431

  Page Number(s) - h291-h298

  Pubished in - Volume 13 | Issue 12 | December 2025

  DOI (Digital Object Identifier) -    https://doi.org/10.56975/ijcrt.v13i12.299431

  Publisher Name - IJCRT | www.ijcrt.org | ISSN : 2320-2882

  E-ISSN Number - 2320-2882

  Cite this article

  Chirag D. Shah,  Dr. Shailesh A. Chaudhari,   "Fine-Tuning Of Distilbert For Gujarati-English Code-Mixed Language Identification In Resource Constrained Environment", International Journal of Creative Research Thoughts (IJCRT), ISSN:2320-2882, Volume.13, Issue 12, pp.h291-h298, December 2025, Available at :http://www.ijcrt.org/papers/IJCRT2512826.pdf

  Share this article

  Article Preview

  Indexing Partners

indexer
indexer
indexer
indexer
indexer
indexer
indexer
indexer
indexer
indexer
indexer
indexer
indexer
indexer
indexer
indexer
indexer
indexer
indexer
Call For Paper February 2026
Indexing Partner
ISSN and 7.97 Impact Factor Details


ISSN
ISSN
ISSN: 2320-2882
Impact Factor: 7.97 and ISSN APPROVED
Journal Starting Year (ESTD) : 2013
ISSN
ISSN and 7.97 Impact Factor Details


ISSN
ISSN
ISSN: 2320-2882
Impact Factor: 7.97 and ISSN APPROVED
Journal Starting Year (ESTD) : 2013
ISSN
DOI Details

Providing A digital object identifier by DOI.org How to get DOI?
For Reviewer /Referral (RMS) Earn 500 per paper
Our Social Link
Open Access
This material is Open Knowledge
This material is Open Data
This material is Open Content
Indexing Partner

Scholarly open access journals, Peer-reviewed, and Refereed Journals, Impact factor 7.97 (Calculate by google scholar and Semantic Scholar | AI-Powered Research Tool) , Multidisciplinary, Monthly, Indexing in all major database & Metadata, Citation Generator, Digital Object Identifier(DOI)

indexer
indexer
indexer
indexer
indexer
indexer
indexer
indexer
indexer
indexer
indexer
indexer
indexer
indexer
indexer
indexer
indexer
indexer
indexer
indexer
indexer
indexer
indexer
indexer
indexer
indexer
indexer
indexer
indexer
indexer
indexer
indexer
indexer