XiaoZhi AI Chatbot - Open Source Embedded Project

An open-source AI chatbot firmware for ESP32 microcontrollers that enables voice interaction using large language models like Qwen and DeepSeek. It features offline wake-up, streaming ASR/TTS, and utilizes the Model Context Protocol (MCP) for IoT device control and cloud-side capability expansion.

Overview

XiaoZhi is a sophisticated open-source AI chatbot project designed for the ESP32 platform. It serves as a voice interaction entry point, leveraging the power of large language models (LLMs) such as Qwen and DeepSeek to provide a conversational experience. Unlike simple voice assistants, XiaoZhi is built around the Model Context Protocol (MCP), allowing it to act as a bridge between AI intelligence and physical hardware control.

The project is highly versatile, supporting a wide range of ESP32 chips including the ESP32-C3, ESP32-S3, and the high-performance ESP32-P4. It is designed to be accessible to both beginners—who can use pre-compiled firmware—and advanced developers looking to customize the interaction flow or hardware integration.

Key Features

XiaoZhi packs a comprehensive suite of features tailored for modern AI hardware:

Voice Interaction Stack: Implements a streaming architecture combining Automatic Speech Recognition (ASR), Large Language Models (LLM), and Text-to-Speech (TTS) for low-latency conversations.
Offline Capabilities: Utilizes Espressif’s ESP-SR for offline voice wake-up and 3D Speaker for speaker recognition and identification.
Multi-Protocol Communication: Supports both WebSocket and a hybrid MQTT+UDP protocol for robust data and audio transmission.
Audio Processing: Uses the OPUS codec for high-quality, efficient audio streaming.
IoT Control via MCP: Implements the Model Context Protocol on both the device side (for controlling LEDs, servos, and GPIOs) and the cloud side (for smart home integration and desktop operations).
Rich Visuals: Supports OLED and LCD displays using the LVGL library to show emojis, battery status, and chat backgrounds.
Connectivity: Compatible with standard Wi-Fi and ML307 Cat.1 4G modules for mobile applications.

Technical Implementation

The project is built on the ESP-IDF (v5.4+) framework, utilizing FreeRTOS for task management. The firmware is optimized for memory efficiency, particularly for constrained devices like the ESP32-C3, by employing techniques such as compressed fonts and selective LVGL widget compilation.

For audio, the system manages I2S interfaces for microphones and speakers, handling the complexities of full-duplex audio streaming. The integration of the MCP protocol is a standout feature, providing a standardized way for the LLM to call “tools” on the device, such as adjusting volume or querying sensor data.

Hardware Support

XiaoZhi is compatible with over 70 open-source hardware designs. Notable supported platforms include:

Espressif ESP32-S3-BOX3
M5Stack CoreS3 and AtomS3R
Waveshare ESP32-S3-Touch-AMOLED
LILYGO T-Circle-S3
Various DIY breadboard configurations using standard components like the INMP441 microphone and MAX98357A amplifier.

Ecosystem and Development

The project is supported by a broad ecosystem, including multiple server-side implementations in Python, Java, and Go. This allows users to host their own backends or connect to the official XiaoZhi cloud service. Additionally, a web-based “Custom Assets Generator” allows developers to easily create and deploy custom wake words, fonts, and UI elements without deep-diving into the codebase.