Home About Projects Writings

Generative AI Chatbot

What is a Generative AI Chatbot ?
Jun 19th 2025
A generative AI chatbot is an artificial intelligence system that can engage in human-like conversations by generating text responses dynamically. Unlike rule-based chatbots, which rely on predefined scripts, generative chatbots use large language models (LLMs) like GPT-4, Mistral, or LLaMA to produce contextually relevant answers.
Key Features
  1. Natural Language Understanding (NLU): Interprets user intent.
  2. Context Retention: Remembers conversation history.
  3. Dynamic Responses: Generates unique replies instead of canned responses.
  4. Multimodal Capabilities: Some models (e.g., GPT-4V) can process images, audio, and text.
Applications
  • Customer Support: Automates FAQs and troubleshooting.
  • Personal Assistants: Schedules tasks, answers queries.
  • Education: Tutors students in various subjects.
  • Entertainment: Role-playing, storytelling, and gaming.
Challenges
  • Hallucinations: May generate incorrect or fabricated information.
  • Bias: Can reflect biases present in training data.
  • Compute Costs: Running LLMs requires significant resources.


How the Chatbot Works
Architecture Overview
This chatbot uses:
  1. Mistral 7B Instruct: A 7-billion-parameter LLM fine-tuned for dialogue.
  2. Hugging Face Transformers: Loads the model and tokenizer.
  3. Gradio: Provides a user-friendly web interface.
Key Components
1. Model Loading
  • The AutoModelForCausalLM and AutoTokenizer classes load Mistral 7B in 16-bit precision (torch.float16) for efficiency.
  • device_map="auto" ensures the model runs on GPU if available.
2. Chat Template
  • The apply_chat_template method formats messages into Mistral’s expected prompt structure:
  • text
  • <s>[INST] User message [/INST] Model reply </s>
3. Text Generation
  • model.generate() parameters:
    • max_new_tokens=256: Limits response length.
    • temperature=0.7: Balances creativity vs. determinism.
    • do_sample=True: Enables probabilistic sampling.
4. Gradio Interface
  • gr.ChatInterface creates a chat UI with history tracking.
  • share=True generates a public link (uses Gradio’s proxy servers).
Optimizations
  • Quantization: Use 4-bit quantization (bitsandbytes) to reduce GPU memory usage.
  • Caching: Implement Redis/Memcached to store frequent queries.
  • RAG: Add retrieval-augmented generation for domain-specific knowledge.
Limitations
  • Latency: Mistral 7B may respond slower than smaller models.
  • VRAM Requirements: Requires ≥16GB GPU RAM for float16 inference.

This code runs a locally hosted chatbot using Mistral 7B (via Hugging Face transformers).
Prerequisites
  • Install dependencies:
pip install torch transformers accelerate gradio sentencepiece
  • A GPU (e.g., NVIDIA T4/A100) is recommended for faster inference.
Code (mistral_chatbot.py)
from transformers import AutoModelForCausalLM, AutoTokenizer
import gradio as gr
import torch

# Load Mistral 7B model and tokenizer
model_name = "mistralai/Mistral-7B-Instruct-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")

def respond(message, history):
    # Format conversation history
    chat_history = []
    for user_msg, bot_msg in history:
        chat_history.append({"role": "user", "content": user_msg})
        chat_history.append({"role": "assistant", "content": bot_msg})
    chat_history.append({"role": "user", "content": message})

    # Tokenize input
    inputs = tokenizer.apply_chat_template(chat_history, return_tensors="pt").to(model.device)

    # Generate response
    outputs = model.generate(inputs, max_new_tokens=256, temperature=0.7, do_sample=True)
    response = tokenizer.decode(outputs[0][len(inputs[0]):], skip_special_tokens=True)

    return response

# Launch Gradio interface
gr.ChatInterface(
    respond,
    title="Mistral 7B Chatbot",
    description="A conversational AI powered by Mistral 7B."
).launch(share=True)  # Set `share=False` for local-only use