MarAI is an conversational AI designed specifically for the Marathi language. It combines the power of Rasa for intent recognition and dialogue management with fallback mechanisms including TF-IDF based retrieval and Google Search integration.
- Rasa-powered Conversations: Primary dialogue management using Rasa.
- TF-IDF Fallback: Corpus-based retrieval for handling out-of-scope queries.
- googlesearch-python Integration: Content retrieval focused on Marathi Wikipedia.
- Response Paraphrasing: Natural-sounding responses using multilingual paraphrase generation.
- React-based Frontend: A sweet user interface.
Important Note: MarAI is not a transformer-based large language model (LLM). Instead, it uses a combination of rule-based systems, machine learning techniques, and retrieval-based methods to provide responses. While it can engage in Marathi conversations and provide information, its capabilities are different from and more limited than those of large language models like GPT-3 or BERT. Hence, it is not always accurate or contextually appropriate.
Future Plans: We initially considered implementing a quantized text generation model, specifically fine-tuned on a Marathi corpus. However, due to hardware limitations, time constraints, and resource considerations, we opted for our current approach. In the future, we plan to enhance MarAI by replacing the current fallback mechanism with a LLM for Marathi, such as Misal. This upgrade would significantly improve the system's response generation capabilities and contextual understanding.
- User input is first processed by the Marathi Rasa model, specifically trained for handling basic, day-to-day conversations in Marathi.
- If Rasa fails to handle the query, the TF-IDF based retriever searches the local corpus.
- If relevant content is found in the local corpus, it'll be sent to the user.
- If no suitable content is found, a Google search is performed, filtering for Marathi Wikipedia articles.
- The retrieved content is saved to the corpus for future use.
- Responses from Marathi Rasa model are paraphrased using AI4Bharat's MultiIndicParaphraseGeneration for natural language output.
Note: Currently we are NOT rephrasing any response from TF-IDF fallback mechanism, you can modify the
actions.py
file if you want to rephrase the content fetched from Wikipedia or the local corpus.
-
Clone the repository
git clone https://github.com/ryukaizen/marai.git cd marai
-
Set up the environment
docker-compose up --build
-
Access the application
Open your browser and navigate to
http://localhost:3000
marai
├── marai/ # Rasa project files
│ ├── actions/ # Custom actions including TF-IDF retrieval
│ ├── data/ # Training data (NLU, stories, rules)
│ └── models/ # Trained Rasa models
├── public/ # Static assets for the React app
├── src/ # React application source code
└── docker-compose.yml # Docker composition file
To train and run the Rasa model:
-
Train the model:
rasa train
-
Test the model in shell mode:
rasa shell
-
Run Rasa as an API (for frontend interaction):
rasa run --enable-api --cors "*"
You can add your own data to the corpus by adding .txt
files to the marai/actions/corpora
directory. The retriever will automatically include these files in its knowledge base.
The retrieval mechanism can be fine-tuned by adjusting the conditions in the is_result_relevant
function in retriever.py
. Key parameters to consider:
relevance_score
: Cosine similarity threshold (currently 0.2)term_match_percentage
: Percentage of query terms that should match the result (currently 0.3)name_similarity
: Fuzzy matching threshold for document names (currently 0.8)
To experiment with AI4Bharat's paraphraser model, refer to the inference_test.py
file. This script allows you to test and evaluate the performance of the paraphrasing functionality.
The rephrase
function in actions.py
uses AI4Bharat's model for paraphrasing:
def rephrase(self, message):
inp = self.tokenizer(message + " </s> " + self.lang_id, add_special_tokens=False, return_tensors="pt", padding=True).input_ids
model_output = self.model.generate(
inp,
use_cache=True, # Enables caching of key/value pairs for faster decoding
no_repeat_ngram_size=2, # Prevents repetition of 2-gram phrases in the output
encoder_no_repeat_ngram_size=2, # Prevents repetition of 2-gram phrases in the encoder
num_beams=2, # Number of beams for beam search. Higher values = more diverse outputs but slower
max_length=30, # Maximum length of the generated sequence
min_length=10, # Minimum length of the generated sequence
early_stopping=True, # Whether to stop the beam search when at least num_beams sentences are finished per batch
pad_token_id=self.pad_id, # Token ID for padding
bos_token_id=self.bos_id, # Token ID for beginning of sentence
eos_token_id=self.eos_id, # Token ID for end of sentence
decoder_start_token_id=self.tokenizer._convert_token_to_id_with_added_voc(self.lang_id) # Start token ID for the decoder
)
return self.tokenizer.decode(model_output[0], skip_special_tokens=True, clean_up_tokenization_spaces=False)
Adjust these parameters to fine-tune the paraphrasing output.
Rasa uses several configuration files to define the behavior of the chatbot:
-
data/nlu.yml: Contains training data for the natural language understanding (NLU) model. This includes example user utterances and their corresponding intents and entities.
-
data/stories.yml: Defines conversation flows, showing how the bot should respond to different sequences of user intents.
-
data/rules.yml: Contains rules for specific conversation patterns that should always be followed, regardless of the conversation history.
-
domain.yml: Defines the universe of the chatbot, including intents, entities, slots, actions, and responses.
-
config.yml: Configures the NLU pipeline and policy ensemble for the Rasa model.
-
endpoints.yml: Specifies the URLs for the different endpoints the bot can use.
-
credentials.yml: Contains credentials for external services the bot might use.
To customize the bot's behavior, you'll primarily work with nlu.yml
, stories.yml
, rules.yml
, and domain.yml
.
Whether big or small, contributions are always warmly welcomed!
This project is licensed under the MIT License - see the LICENSE file for details.
We gratefully acknowledge the use of AI4Bharat's technologies in this project:
- The MultiIndicParaphraseGeneration model for natural language generation. Please cite the following paper when using or referencing this work:
@inproceedings{Kumar2022IndicNLGSM,
title={IndicNLG Suite: Multilingual Datasets for Diverse NLG Tasks in Indic Languages},
author={Aman Kumar and Himani Shrotriya and Prachi Sahu and Raj Dabre and Ratish Puduppully and Anoop Kunchukuttan and Amogh Mishra and Mitesh M. Khapra and Pratyush Kumar},
year={2022},
url = "https://arxiv.org/abs/2203.05437"
}
- The "@ai4bharat/indic-transliterate" React library, which is used for real-time quick transliteration from English to Marathi on the text input. This feature greatly enhances the user experience for those more comfortable typing in English.
We would also like to thank the smallstep.ai team for giving valuable insights on how to proceed with this project.
लाभले आम्हास भाग्य बोलतो मराठी
जाहलो खरेच धन्य ऐकतो मराठी
धर्म , पंथ , जात एक जाणतो मराठी
एवढ्या जगात माय मानतो मराठी