Skip to content

DenisovAV/flutter_gemma

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

535 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

# Flutter Gemma

CI Tests Release Build pub package

ko-fi

The plugin supports not only Gemma, but also other models. Here's the full list of supported models: Gemma 4 E2B/E4B, Gemma3n E2B/E4B, FastVLM 0.5B, Gemma-3 1B, Gemma 3 270M, FunctionGemma 270M, Qwen3 0.6B, Qwen 2.5, Phi-4 Mini, DeepSeek R1, SmolLM 135M.

*Note: The flutter_gemma plugin supports Gemma 4 and Gemma3n (with multimodal vision and audio support), FastVLM (vision), Gemma-3, FunctionGemma, Qwen3, Qwen 2.5, Phi-4, DeepSeek R1 and SmolLM. Desktop platforms (macOS, Windows, Linux) require .litertlm model format.

Gemma is a family of lightweight, state-of-the art open models built from the same research and technology used to create the Gemini models

gemma_github_cover

Bring the power of Google's lightweight Gemma language models directly to your Flutter applications. With Flutter Gemma, you can seamlessly incorporate advanced AI capabilities into your Flutter applications, all without relying on external servers.

There is an example of using:

gemma_github_gif

Features

  • Local Execution: Run Gemma models directly on user devices for enhanced privacy and offline functionality.
  • Platform Support: Compatible with iOS, Android, Web, macOS, Windows, and Linux platforms.
  • πŸ–₯️ Desktop Support: Native desktop apps (macOS, Windows, Linux) with GPU acceleration via LiteRT-LM, called directly from Dart through dart:ffi β€” no JVM/JRE bundling. See DESKTOP_SUPPORT.md for details.
  • πŸ–ΌοΈ Multimodal Support: Text + Image input with Gemma3n vision models
  • πŸŽ™οΈ Audio Input: Record and send audio messages with Gemma3n E2B/E4B models (Android, iOS device, Desktop)
  • πŸ› οΈ Function Calling: Enable your models to call external functions and integrate with other services (supported by select models)
  • 🧠 Thinking Mode: View the reasoning process of DeepSeek and Gemma 4 models with thinking blocks
  • πŸ›‘ Stop Generation: Cancel text generation mid-process on Android, iOS, Web, and Desktop
  • βš™οΈ Backend Switching: Choose between CPU and GPU backends for each model individually in the example app
  • πŸ” Advanced Model Filtering: Filter models by features (Multimodal, Function Calls, Thinking) with expandable UI
  • πŸ“Š Model Sorting: Sort models alphabetically, by size, or use default order in the example app
  • LoRA Support: Efficient fine-tuning and integration of LoRA (Low-Rank Adaptation) weights for tailored AI behavior.
  • πŸ“₯ Enhanced Downloads: Smart retry logic with exponential backoff for reliable model downloads
  • πŸ”§ Download Reliability: Automatic restart logic for interrupted downloads (resume not supported by HuggingFace CDN)
  • πŸ“± Android Foreground Service: Large downloads (>500MB) automatically use foreground service to bypass 9-minute timeout
  • πŸ”§ Model Replace Policy: Configurable model replacement system (keep/replace) with automatic model switching
  • πŸ“Š Text Embeddings: Generate vector embeddings from text using EmbeddingGemma and Gecko models
  • πŸ”§ Unified Model Management: Single system for managing both inference and embedding models with automatic validation
  • πŸ’Ύ Web Persistent Caching: Models persist across browser restarts using Cache API (Web only)

What's new in 0.14.0

  • πŸ–₯️ Desktop rewritten on dart:ffi β€” no JVM, no gRPC, no separate server. Native libs auto-fetched at build time.
  • 🍎 iOS Metal GPU for .litertlm models on physical devices via FFI.
  • 🐧 Linux GPU (Vulkan/WebGPU) and πŸͺŸ Windows GPU (DirectX 12) ready out of the box.
  • πŸ€– Android β€” Kotlin LiteRtLm dependency removed; FFI used exclusively for .litertlm.

See CHANGELOG.md for the full release history.

Model File Types

Flutter Gemma supports different model file formats, which are grouped into two types based on how chat templates are handled:

Type 1: MediaPipe-Managed Templates

  • .task files: MediaPipe-optimized format for mobile (Android/iOS)
  • .litertlm files: LiteRT-LM format for Android, iOS, and Desktop platforms

Both formats have identical behavior β€” MediaPipe handles chat templates internally.

Type 2: Manual Template Formatting

  • .bin files: Standard binary format
  • .tflite files: LiteRT format (formerly TensorFlow Lite)

Both formats require manual chat template formatting in your code.

Note: The plugin automatically detects the file extension and applies appropriate formatting. When specifying ModelFileType in your code:

  • Use ModelFileType.task for .task and .litertlm files (same behavior)
  • Use ModelFileType.binary for .bin and .tflite files (same behavior)

Format by Platform

Format Android iOS Web Desktop Use Case
.task βœ… βœ… βœ… ❌ Older models (Gemma3n, Gemma 3, DeepSeek, Qwen 2.5, Phi-4)
.litertlm βœ… βœ… ΒΉ ❌ βœ… Newer models (Gemma 4, Qwen3, FastVLM + desktop for all)
-web.task ❌ ❌ βœ… ❌ Web-specific builds (e.g. Gemma 4, Gemma3n)
.bin βœ… βœ… βœ… ❌ Manual chat template formatting required
.tflite βœ… βœ… βœ… βœ… Embeddings only (EmbeddingGemma, Gecko)

ΒΉ iOS .litertlm runs on the FFI engine β€” vision and audio supported on physical devices. The Simulator stays CPU-only because Metal sim has a 256 MB single-allocation cap.

Model Capabilities

The example app offers a curated list of models, each suited for different tasks. Here's a breakdown of the models available and their capabilities:

Model Family Best For Function Calling Thinking Mode Vision Languages Size
Gemma 4 E2B Next-gen multimodal chat β€” text, image, audio βœ… βœ… βœ… Multilingual 2.4GB
Gemma 4 E4B Next-gen multimodal chat β€” text, image, audio βœ… βœ… βœ… Multilingual 4.3GB
Gemma3n On-device multimodal chat and image analysis βœ… ❌ βœ… Multilingual 3-6GB
FastVLM 0.5B Fast vision-language inference ❌ ❌ βœ… Multilingual 0.5GB
Phi-4 Mini Advanced reasoning and instruction following βœ… ❌ ❌ Multilingual 3.9GB
DeepSeek R1 High-performance reasoning and code generation βœ… βœ… ❌ Multilingual 1.7GB
Qwen3 0.6B Compact multilingual chat with function calling βœ… βœ… ❌ Multilingual 586MB
Qwen 2.5 Strong multilingual chat and instruction following βœ… ❌ ❌ Multilingual 0.5-1.6GB
Gemma 3 1B Balanced and efficient text generation βœ… ❌ ❌ Multilingual 0.5GB
Gemma 3 270M Ideal for fine-tuning (LoRA) for specific tasks ❌ ❌ ❌ Multilingual 0.3GB
FunctionGemma 270M Specialized for function calling on-device βœ… ❌ ❌ Multilingual 284MB
SmolLM 135M Ultra-compact, resource-constrained devices ❌ ❌ ❌ English 135MB

ModelType Reference

When installing models, you need to specify the correct ModelType. Use this table to find the right type for your model:

Model Family ModelType Examples
Gemma (all variants) ModelType.gemmaIt Gemma 4 E2B/E4B, Gemma 3 1B, Gemma 3 270M, Gemma3n E2B/E4B
DeepSeek ModelType.deepSeek DeepSeek R1
Qwen 2.5 ModelType.qwen Qwen 2.5 1.5B, Qwen 2.5 0.5B
Qwen 3 ModelType.qwen3 Qwen3 0.6B
FunctionGemma ModelType.functionGemma FunctionGemma 270M IT
Phi ModelType.phi Phi-4 Mini
General ModelType.general FastVLM 0.5B, SmolLM 135M

Usage Example:

// Gemma models
await FlutterGemma.installModel(modelType: ModelType.gemmaIt)
  .fromNetwork(url).install();

// DeepSeek models
await FlutterGemma.installModel(modelType: ModelType.deepSeek)
  .fromNetwork(url).install();

// Phi-4 (uses general type)
await FlutterGemma.installModel(modelType: ModelType.general)
  .fromNetwork(url).install();

Installation

  1. Add flutter_gemma to your pubspec.yaml:

    dependencies:
      flutter_gemma: latest_version
  2. Run flutter pub get to install.

Setup

⚠️ Important: Complete platform-specific setup before using the plugin.

  1. Download Model and optionally LoRA Weights: Obtain a model from the Supported Models section or HuggingFace
  1. Platform specific setup:

iOS

  • Set minimum iOS version in Podfile:
platform :ios, '16.0'  # Required for MediaPipe GenAI
  • Enable file sharing in Info.plist:
<key>UIFileSharingEnabled</key>
<true/>
  • Add network access description in Info.plist (for development):
<key>NSLocalNetworkUsageDescription</key>
<string>This app requires local network access for model inference services.</string>
  • Enable performance optimization in Info.plist (optional):
<key>CADisableMinimumFrameDurationOnPhone</key>
<true/>
  • Add memory entitlements in Runner.entitlements (for large models):
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
	<key>com.apple.developer.kernel.extended-virtual-addressing</key>
	<true/>
	<key>com.apple.developer.kernel.increased-memory-limit</key>
	<true/>
	<key>com.apple.developer.kernel.increased-debugging-memory-limit</key>
	<true/>
</dict>
</plist>
  • Change the linking type of pods to static in Podfile:
use_frameworks! :linkage => :static
  • Setup LiteRT-LM dylib symlinks in ios/Podfile post_install block. LiteRT-LM's gpu_registry calls dlopen("libLiteRtMetalAccelerator.dylib") by basename at runtime. Native Assets bundles the dylibs as .frameworks, so each framework also needs a flat lib*.dylib symlink alongside it (required for GPU on physical iOS devices):
post_install do |installer|
  installer.pods_project.targets.each do |target|
    flutter_additional_ios_build_settings(target)
  end

  # flutter_gemma: create lib*.dylib symlinks next to the bundled
  # .framework so LiteRT-LM's gpu_registry can dlopen by basename.
  installer.aggregate_targets.each do |aggregate_target|
    aggregate_target.user_targets.each do |user_target|
      next if user_target.shell_script_build_phases.any? { |p| p.name == '[flutter_gemma] Setup LiteRT-LM iOS' }
      phase = user_target.new_shell_script_build_phase('[flutter_gemma] Setup LiteRT-LM iOS')
      phase.shell_script = <<~SHELL
        set -e
        FRAMEWORKS="${BUILT_PRODUCTS_DIR}/${PRODUCT_NAME}.app/Frameworks"
        if [ ! -d "${FRAMEWORKS}" ]; then
          echo "[flutter_gemma] no Frameworks/ in ${PRODUCT_NAME}.app β€” skipping"
          exit 0
        fi
        for base in LiteRtMetalAccelerator GemmaModelConstraintProvider; do
          src="${base}.framework/${base}"
          if [ ! -e "${FRAMEWORKS}/${src}" ]; then
            echo "[flutter_gemma] ${FRAMEWORKS}/${src} missing β€” Native Assets did not bundle it"
            continue
          fi
          dst="${FRAMEWORKS}/lib${base}.dylib"
          if [ ! -e "${dst}" ]; then
            ln -sf "${src}" "${dst}"
            echo "[flutter_gemma] symlinked lib${base}.dylib -> ${src}"
          fi
        done
      SHELL
    end
  end
end

Android

  • If you want to use a GPU to work with the model, you need to add OpenGL support in the manifest.xml. If you plan to use only the CPU, you can skip this step.

Add to 'AndroidManifest.xml' above tag </application>

 <uses-native-library
     android:name="libOpenCL.so"
     android:required="false"/>
 <uses-native-library android:name="libOpenCL-car.so" android:required="false"/>
 <uses-native-library android:name="libOpenCL-pixel.so" android:required="false"/>
  • For release builds with ProGuard/R8 enabled, the plugin automatically includes necessary ProGuard rules. If you encounter issues with UnsatisfiedLinkError or missing classes in release builds, ensure your proguard-rules.pro includes:
# MediaPipe
-keep class com.google.mediapipe.** { *; }
-dontwarn com.google.mediapipe.**

# Protocol Buffers
-keep class com.google.protobuf.** { *; }
-dontwarn com.google.protobuf.**

# RAG functionality
-keep class com.google.ai.edge.localagents.** { *; }
-dontwarn com.google.ai.edge.localagents.**

Web

  • Web currently works only GPU backend models, CPU backend models are not supported by MediaPipe yet

  • Model compatibility: Mobile .task models often don't work on web. Use web-specific variants: -web.task or .litertlm files. Check model repository for web-compatible versions.

  • Add dependencies to index.html file in web folder

  <script type="module">
  import { FilesetResolver, LlmInference } from 'https://cdn.jsdelivr.net/npm/@mediapipe/tasks-genai@0.10.27';
  window.FilesetResolver = FilesetResolver;
  window.LlmInference = LlmInference;
  </script>

Desktop (macOS, Windows, Linux)

⚠️ Desktop Model Format

Desktop platforms use LiteRT-LM format only (.litertlm files). MediaPipe .task and .bin models used on mobile/web are NOT compatible with desktop.

Since 0.14.0 desktop inference and embeddings both use the LiteRT-LM C API via dart:ffi directly in the Dart process β€” no JVM, no gRPC, no separate server. Native libraries are downloaded by hook/build.dart (Native Assets) at build time and bundled into the app automatically.

Platform Architecture GPU Acceleration Status
macOS arm64 (Apple Silicon) Metal βœ… Ready
macOS x86_64 (Intel) - ❌ Not Supported
Windows x86_64 DirectX 12 βœ… Ready
Windows arm64 - ❌ Not Supported
Linux x86_64 Vulkan βœ… Ready
Linux arm64 Vulkan βœ… Ready

macOS Setup:

The plugin uses Flutter Native Assets to bundle LiteRT-LM dylibs as .frameworks. The LiteRT-LM runtime, however, calls dlopen("libLiteRtMetalAccelerator.dylib") by basename at runtime, so each framework also needs a flat lib*.dylib symlink alongside it. Add this to your macos/Podfile post_install block:

post_install do |installer|
  installer.pods_project.targets.each do |target|
    flutter_additional_macos_build_settings(target)
  end

  # flutter_gemma: create lib*.dylib symlinks next to the bundled
  # .framework so LiteRT-LM's gpu_registry can dlopen by basename.
  installer.aggregate_targets.each do |aggregate_target|
    aggregate_target.user_targets.each do |user_target|
      next if user_target.shell_script_build_phases.any? { |p| p.name == '[flutter_gemma] Setup LiteRT-LM macOS' }
      phase = user_target.new_shell_script_build_phase('[flutter_gemma] Setup LiteRT-LM macOS')
      phase.shell_script = <<~SHELL
        set -e
        FRAMEWORKS="${BUILT_PRODUCTS_DIR}/${PRODUCT_NAME}.app/Contents/Frameworks"
        if [ ! -d "${FRAMEWORKS}" ]; then
          echo "[flutter_gemma] no Contents/Frameworks/ in ${PRODUCT_NAME}.app β€” skipping"
          exit 0
        fi
        for base in LiteRtMetalAccelerator GemmaModelConstraintProvider; do
          src="${base}.framework/Versions/Current/${base}"
          if [ ! -e "${FRAMEWORKS}/${src}" ]; then
            echo "[flutter_gemma] ${FRAMEWORKS}/${src} missing β€” Native Assets did not bundle it"
            continue
          fi
          dst="${FRAMEWORKS}/lib${base}.dylib"
          if [ ! -e "${dst}" ]; then
            ln -sf "${src}" "${dst}"
            echo "[flutter_gemma] symlinked lib${base}.dylib -> ${src}"
          fi
        done
      SHELL
    end
  end
end

Add to macos/Runner/DebugProfile.entitlements and Release.entitlements:

<key>com.apple.security.cs.disable-library-validation</key>
<true/>

Windows Setup:

No additional configuration required. hook/build.dart (Native Assets) downloads LiteRtLm.dll + companion DLLs + the DXC runtime (dxil.dll, dxcompiler.dll v1.9.2602) from the GitHub release on first build, verifies them via SHA256, and bundles them next to your app.exe. End users need the Microsoft Visual C++ Redistributable 2019+ (download) β€” most modern Windows 10/11 systems already have it.

Linux Setup:

No additional configuration required. Build dependencies:

sudo apt install clang cmake ninja-build libgtk-3-dev

For GPU acceleration, ensure Vulkan drivers are installed:

sudo apt install vulkan-tools libvulkan1

πŸ“š Full Desktop Documentation β†’

Quick Start

⚠️ Important: Complete platform setup before running this code.

1. Install a Model (One Time)

import 'package:flutter_gemma/flutter_gemma.dart';

// Install model
await FlutterGemma.installModel(
  modelType: ModelType.gemmaIt,
).fromNetwork(
  'https://huggingface.co/google/gemma-3-2b-it/resolve/main/gemma-3-2b-it-gpu-int8.task',
  token: 'your_hf_token',
).withProgress((progress) {
  print('Downloading: ${progress.percentage}%');
}).install();

2. Create and Use Model (Multiple Times)

// Create model with specific configuration
final model = await FlutterGemma.getActiveModel(
  maxTokens: 2048,
  preferredBackend: PreferredBackend.gpu,
);

// Use model
final chat = await model.createChat();
await chat.addQueryChunk(Message.text(
  text: 'Explain quantum computing',
  isUser: true,
));
final response = await chat.generateChatResponse();

// Cleanup
await model.close();

System Instructions

Control model behavior with a system-level instruction:

final chat = await model.createChat(
  systemInstruction: 'You are a concise assistant. Always respond in bullet points.',
);

Platform support:

  • Android .litertlm / Desktop: Passed natively via ConversationConfig.systemInstruction
  • Android .task / iOS / Web: Prepended to first user message as fallback

3. Multiple Instances from Same Model

// Install once
await FlutterGemma.installModel(modelType: ModelType.gemmaIt)
  .fromNetwork(url).install();

// Create multiple instances
final quickModel = await FlutterGemma.getActiveModel(maxTokens: 512);
final deepModel = await FlutterGemma.getActiveModel(maxTokens: 4096);
// Both use the SAME model file!

Installation Sources

// Network
await FlutterGemma.installModel(modelType: ModelType.gemmaIt)
  .fromNetwork('https://example.com/model.task', token: 'optional')
  .install();

// Flutter assets
await FlutterGemma.installModel(modelType: ModelType.gemmaIt)
  .fromAsset('assets/models/model.task')
  .install();

// Native bundle
await FlutterGemma.installModel(modelType: ModelType.gemmaIt)
  .fromBundled('model.task')
  .install();

// External file
await FlutterGemma.installModel(modelType: ModelType.gemmaIt)
  .fromFile('/path/to/model.task')
  .install();

Modern API vs Legacy API

Modern API (Recommended) βœ…

Benefits:

  • βœ… Cleaner, more intuitive
  • βœ… Type-safe ModelSource
  • βœ… Automatic active model management
  • βœ… Install once, create many instances

Usage:

await FlutterGemma.installModel(modelType: ModelType.gemmaIt)
  .fromNetwork(url).install();
final model = await FlutterGemma.getActiveModel(maxTokens: 2048);

Legacy API ⚠️ Deprecated

⚠️ DEPRECATED: This API is maintained for backwards compatibility only. New projects should use the Modern API above.

Still works but requires manual ModelType specification:

final model = await FlutterGemmaPlugin.instance.createModel(
  modelType: ModelType.gemmaIt,  // Must specify every time
  maxTokens: 2048,
);

Initialize Flutter Gemma

Add to your main.dart:

import 'package:flutter_gemma/core/api/flutter_gemma.dart';

void main() {
  WidgetsFlutterBinding.ensureInitialized();

  // Optional: Initialize with HuggingFace token for gated models
  FlutterGemma.initialize(
    huggingFaceToken: const String.fromEnvironment('HUGGINGFACE_TOKEN'),
    maxDownloadRetries: 10,
  );

  runApp(MyApp());
}

Configuration Options:

  • huggingFaceToken: Authentication token for gated models (Gemma3n, EmbeddingGemma)
  • maxDownloadRetries: Number of retry attempts for failed downloads (default: 10)
  • webStorageMode: (Web only) Storage strategy for model files (default: cacheApi)
    • WebStorageMode.cacheApi: Cache API with Blob URLs (for models <2GB)
    • WebStorageMode.streaming: OPFS streaming (for large models >2GB like E4B, 7B)
    • WebStorageMode.none: No caching (ephemeral mode for testing)

Example:

FlutterGemma.initialize(
  huggingFaceToken: const String.fromEnvironment('HUGGINGFACE_TOKEN'),
  maxDownloadRetries: 10,
  webStorageMode: WebStorageMode.streaming,  // For large models (>2GB)
);

Next Steps:

HuggingFace Authentication πŸ”

Many models require authentication to download from HuggingFace. Never commit tokens to version control.

βœ… Recommended: config.json Pattern

This is the most secure way to handle tokens in development and production.

Step 1: Create config template file config.json.example:

{
  "HUGGINGFACE_TOKEN": ""
}

Step 2: Copy and add your token:

cp config.json.example config.json
# Edit config.json and add your token from https://huggingface.co/settings/tokens

Step 3: Add to .gitignore:

# Never commit tokens!
config.json

Step 4: Run with config:

flutter run --dart-define-from-file=config.json

Step 5: Access in code:

void main() {
  WidgetsFlutterBinding.ensureInitialized();

  // Read from environment (populated by --dart-define-from-file)
  const token = String.fromEnvironment('HUGGINGFACE_TOKEN');

  // Initialize with token (optional if all models are public)
  FlutterGemma.initialize(
    huggingFaceToken: token.isNotEmpty ? token : null,
  );

  runApp(MyApp());
}

Alternative: Environment Variables

export HUGGINGFACE_TOKEN=hf_your_token_here
flutter run --dart-define=HUGGINGFACE_TOKEN=$HUGGINGFACE_TOKEN

Alternative: Per-Download Token

// Pass token directly for specific downloads
await FlutterGemma.installModel(
  modelType: ModelType.gemmaIt,
)
  .fromNetwork(
    'https://huggingface.co/google/gemma-3n-E2B-it-litert-preview/resolve/main/gemma-3n-E2B-it-int4.task',
    token: 'hf_your_token_here',  // ⚠️ Not recommended - use config.json
  )
  .install();

Which Models Require Authentication?

Common gated models:

  • βœ… Gemma3n (E2B, E4B) - google/ repos are gated
  • βœ… Gemma 3 1B - litert-community/ requires access
  • βœ… Gemma 3 270M - litert-community/ requires access
  • βœ… EmbeddingGemma - litert-community/ requires access

Public models (no auth needed):

  • ❌ DeepSeek, Qwen3, Qwen 2.5, SmolLM, Phi-4, FastVLM - Public repos

Get your token: https://huggingface.co/settings/tokens

Grant access to gated repos: Visit model page β†’ "Request Access" button

Model Sources πŸ“¦

Flutter Gemma supports multiple model sources with different capabilities:

Source Type Platform Progress Resume Authentication Use Case
NetworkSource All βœ… Detailed ⚠️ Server-dependent βœ… Supported HuggingFace, CDNs, private servers
AssetSource All ⚠️ End only ❌ No ❌ N/A Models bundled in app assets
BundledSource All ⚠️ End only ❌ No ❌ N/A Native platform resources
FileSource Mobile only ⚠️ End only ❌ No ❌ N/A User-selected files (file picker)

NetworkSource - Internet Downloads

Downloads models from HTTP/HTTPS URLs with full progress tracking and authentication.

Features:

  • βœ… Progress tracking (0-100%)
  • ⚠️ Resume after interruption (server-dependent, not supported by HuggingFace CDN)
  • βœ… HuggingFace authentication
  • βœ… Smart retry logic with exponential backoff
  • βœ… Background downloads on mobile
  • βœ… Cancellable downloads with CancelToken
  • βœ… Android foreground service for large downloads (>500MB)

Example:

// Public model
await FlutterGemma.installModel(
  modelType: ModelType.gemmaIt,
)
  .fromNetwork('https://example.com/model.bin')
  .withProgress((progress) => print('$progress%'))
  .install();

// Private model with authentication
await FlutterGemma.installModel(
  modelType: ModelType.gemmaIt,
)
  .fromNetwork(
    'https://huggingface.co/google/gemma-3n-E2B-it-litert-preview/resolve/main/model.task',
    token: 'hf_...',  // Or use FlutterGemma.initialize(huggingFaceToken: ...)
  )
  .withProgress((progress) => setState(() => _progress = progress))
  .install();

Android Foreground Service (Large Downloads):

Android has a 9-minute background execution limit. For large models (>500MB), you can use foreground service mode which shows a notification but bypasses this timeout:

// Auto-detect based on file size (>500MB = foreground) - DEFAULT
await FlutterGemma.installModel(modelType: ModelType.gemmaIt)
  .fromNetwork(url)  // foreground: null (auto-detect)
  .install();

// Force foreground mode (always show notification)
await FlutterGemma.installModel(modelType: ModelType.gemmaIt)
  .fromNetwork(url, foreground: true)
  .install();

// Force background mode (may fail for large files)
await FlutterGemma.installModel(modelType: ModelType.gemmaIt)
  .fromNetwork(url, foreground: false)
  .install();

Foreground Parameter:

  • null (default): Auto-detect based on file size. Files >500MB use foreground service.
  • true: Always use foreground service (shows notification, no timeout)
  • false: Never use foreground service (subject to 9-minute timeout)

Note: iOS uses native URLSession which handles long downloads automatically - no foreground service needed.

Cancelling Downloads:

Use CancelToken to cancel downloads in progress:

import 'package:flutter_gemma/core/model_management/cancel_token.dart';

// Create cancel token
final cancelToken = CancelToken();

// Start download with cancel token
final future = FlutterGemma.installModel(
  modelType: ModelType.gemmaIt,
)
  .fromNetwork(url)
  .withCancelToken(cancelToken)  // ← Pass cancel token via builder
  .withProgress((progress) => print('Progress: $progress%'))
  .install();

// Cancel download from another part of your code
// (e.g., user pressed cancel button)
cancelToken.cancel('User cancelled download');

// Handle cancellation
try {
  await future;
  print('Download completed');
} catch (e) {
  if (CancelToken.isCancel(e)) {
    print('Download was cancelled by user');
  } else {
    print('Download failed: $e');
  }
}

// Check if cancelled
if (cancelToken.isCancelled) {
  print('Reason: ${cancelToken.cancelReason}');
}

CancelToken Features:

  • βœ… Non-breaking: Optional parameter, existing code works without changes
  • βœ… Works with network downloads (inference + embedding models)
  • βœ… Cancels ALL files in multi-file downloads (embedding: model + tokenizer)
  • βœ… Platform-independent (Mobile + Web)
  • βœ… Throws DownloadCancelledException for proper error handling
  • βœ… Thread-safe cancellation

AssetSource - Flutter Assets

Copies models from Flutter assets (declared in pubspec.yaml).

Features:

  • βœ… No network required
  • βœ… Fast installation (local copy)
  • ⚠️ Increases app size significantly
  • βœ… Works offline

Example:

// 1. Add to pubspec.yaml
// assets:
//   - models/gemma-2b-it.bin

// 2. Install from asset
await FlutterGemma.installModel(
  modelType: ModelType.gemmaIt,
)
  .fromAsset('models/gemma-2b-it.bin')
  .install();

BundledSource - Native Resources

Production-Ready Offline Models: Include small models directly in your app bundle for instant availability without downloads.

Use Cases:

  • βœ… Offline-first applications (works without internet from first launch)
  • βœ… Small models (Gemma 3 270M ~300MB)
  • βœ… Core features requiring guaranteed availability
  • ⚠️ Not for large models (increases app size significantly)

Platform Setup:

Android (android/app/src/main/assets/models/)

# Place your model file
android/app/src/main/assets/models/gemma-3-270m-it.task

iOS (Add to Xcode project)

  1. Drag model file into Xcode project
  2. Check "Copy items if needed"
  3. Add to target membership

Web (Static files in web/ directory)

# Place model files in web/ directory
example/web/gemma-3-270m-it.task

# Files are automatically copied to build/web/ during production build
flutter build web

⚠️ Web Platform Limitation:

  • Production only: Bundled resources work ONLY in production builds (flutter build web)
  • Debug mode: Files in web/ are NOT served by flutter run dev server
  • For development: Use NetworkSource or AssetSource instead

Features:

  • βœ… Zero network dependency
  • βœ… No installation delay
  • βœ… No storage permission needed
  • βœ… Direct path usage (no file copying)

Example:

await FlutterGemma.installModel(
  modelType: ModelType.gemmaIt,
)
  .fromBundled('gemma-3-270m-it.task')
  .install();

App Size Impact:

  • SmolLM 135M: ~135MB
  • Gemma 3 270M: ~300MB
  • Qwen3 0.6B: ~586MB
  • Consider hosting large models for download instead

FileSource - External Files (Mobile Only)

References external files (e.g., user-selected via file picker).

Features:

  • βœ… No copying (references original file)
  • βœ… Protected from cleanup
  • ❌ Web not supported (no local file system)

Example:

// Mobile only - after user selects file with file_picker
final path = '/data/user/0/com.app/files/model.task';
await FlutterGemma.installModel(
  modelType: ModelType.gemmaIt,
)
  .fromFile(path)
  .install();

Important: On web, FileSource only works with URLs or asset paths, not local file system paths.

Migration from Legacy to Modern API πŸ”„

If you're upgrading from the Legacy API, here are common migration patterns:

Installing Models

Legacy API Modern API
// Network download
final spec = MobileModelManager.createInferenceSpec(
  name: 'model.bin',
  modelUrl: 'https://example.com/model.bin',
);

await FlutterGemmaPlugin.instance.modelManager
  .downloadModelWithProgress(spec, token: token)
  .listen((progress) {
    print('${progress.overallProgress}%');
  });
// Network download
await FlutterGemma.installModel(
  modelType: ModelType.gemmaIt,
)
  .fromNetwork(
    'https://example.com/model.bin',
    token: token,
  )
  .withProgress((progress) {
    print('$progress%');
  })
  .install();
// From assets
await modelManager.installModelFromAssetWithProgress(
  'model.bin',
  loraPath: 'lora.bin',
).listen((progress) {
  print('$progress%');
});
// From assets
await FlutterGemma.installModel(
  modelType: ModelType.gemmaIt,
)
  .fromAsset('model.bin')
  .withProgress((progress) {
    print('$progress%');
  })
  .install();

// LoRA weights can be installed with the model
await FlutterGemma.installModel(
  modelType: ModelType.gemmaIt,
)
  .fromAsset('model.bin')
  .withLoraFromAsset('lora.bin')
  .install();

Checking Model Installation

Legacy API Modern API
final spec = MobileModelManager.createInferenceSpec(
  name: 'model.bin',
  modelUrl: url,
);

final isInstalled = await FlutterGemmaPlugin
  .instance.modelManager
  .isModelInstalled(spec);
final isInstalled = await FlutterGemma
  .isModelInstalled('model.bin');

Key Migration Notes

  • βœ… Simpler imports: Use package:flutter_gemma/core/api/flutter_gemma.dart
  • βœ… Builder pattern: Chain methods for cleaner code
  • βœ… Callback-based progress: Simpler than streams for most cases
  • βœ… Type-safe sources: Compile-time validation of source types
  • ⚠️ Breaking change: Progress values are now int (0-100) instead of DownloadProgress object
  • ⚠️ Separate files: Model and LoRA weights installed independently

Model Creation and Inference

Modern API (Recommended):

// Create model with runtime configuration
final inferenceModel = await FlutterGemma.getActiveModel(
  maxTokens: 2048,
  preferredBackend: PreferredBackend.gpu,
);

final chat = await inferenceModel.createChat();
await chat.addQueryChunk(Message.text(text: 'Hello!', isUser: true));
final response = await chat.generateChatResponse();

Legacy API (Still supported):

// Works with both Legacy and Modern installation methods
final inferenceModel = await FlutterGemmaPlugin.instance.createModel(
  modelType: ModelType.gemmaIt,
  preferredBackend: PreferredBackend.gpu,
  maxTokens: 2048,
);

final chat = await inferenceModel.createChat();
await chat.addQueryChunk(Message.text(text: 'Hello!', isUser: true));
final response = await chat.generateChatResponse();

Usage (Legacy API) ⚠️ DEPRECATED

The pre-Modern stream-based API (FlutterGemmaPlugin.instance.modelManager, installModelFromAsset, downloadModelFromNetworkWithProgress, etc.) is still supported but deprecated. New projects should use the Modern API above.

πŸ“š Full Legacy API reference: docs/LEGACY_API.md

πŸ–ΌοΈ Message Types

The plugin now supports different types of messages:

// Text only
final textMessage = Message.text(text: "Hello!", isUser: true);

// Text + Image
final multimodalMessage = Message.withImage(
  text: "What's in this image?",
  imageBytes: imageBytes,
  isUser: true,
);

// Image only
final imageMessage = Message.imageOnly(imageBytes: imageBytes, isUser: true);

// Tool response (for function calling)
final toolMessage = Message.toolResponse(
  toolName: 'change_background_color',
  response: {'status': 'success', 'color': 'blue'},
);

// System information message
final systemMessage = Message.systemInfo(text: "Function completed successfully");

// Thinking content (for DeepSeek models)
final thinkingMessage = Message.thinking(text: "Let me analyze this problem...");

// Check if message contains image
if (message.hasImage) {
  print('This message contains an image');
}

// Create a copy of message
final copiedMessage = message.copyWith(text: "Updated text");

πŸ’¬ Response Types

The model can return different types of responses depending on capabilities:

// Handle different response types
chat.generateChatResponseAsync().listen((response) {
  if (response is TextResponse) {
    // Regular text token from the model
    print('Text token: ${response.token}');
    // Use response.token to update your UI incrementally
    
  } else if (response is FunctionCallResponse) {
    // Model wants to call a function (Gemma3n, DeepSeek, Qwen2.5)
    print('Function: ${response.name}');
    print('Arguments: ${response.args}');
    
    // Execute the function and send response back
    _handleFunctionCall(response);
  } else if (response is ThinkingResponse) {
    // Model's reasoning process (DeepSeek models only)
    print('Thinking: ${response.content}');
    
    // Show thinking process in UI
    _showThinkingBubble(response.content);
  }
});

Response Types:

  • TextResponse: Contains a text token (response.token) for regular model output
  • FunctionCallResponse: Contains function name (response.name) and arguments (response.args) when the model wants to call a function
  • ThinkingResponse: Contains the model's reasoning process (response.content) for DeepSeek models with thinking mode enabled

🎯 Supported Models

Platform Support

Model Size Desktop Mobile Web
Gemma 4 E2B 2.4GB βœ… βœ… βœ…
Gemma 4 E4B 4.3GB βœ… βœ… βœ…
Gemma3n E2B 3.1GB βœ… βœ… βœ…
Gemma3n E4B 6.5GB βœ… βœ… βœ…
FastVLM 0.5B 0.5GB βœ… ❌ ❌
Gemma-3 1B 0.5GB βœ… βœ… βœ…
Gemma 3 270M 0.3GB βœ… βœ… βœ…
FunctionGemma 270M 284MB βœ… βœ… ❌
Qwen3 0.6B 586MB βœ… βœ… βœ…
Qwen 2.5 1.5B 1.6GB βœ… βœ… ❌
Qwen 2.5 0.5B 0.5GB ❌ βœ… ❌
SmolLM 135M 135MB ❌ βœ… ❌
Phi-4 Mini 3.9GB βœ… βœ… βœ…
DeepSeek R1 1.7GB ❌ βœ… ❌

πŸ“Š Text Embedding Models

All embedding models generate 768-dimensional vectors. The numbers in names (64/256/512/1024/2048) indicate maximum input sequence length in tokens, not embedding dimension.

Model Parameters Dimensions Max Seq Length Size Best For Auth Required
Gecko 64 110M 768D 64 tokens 110MB Short queries, real-time search ❌
Gecko 256 110M 768D 256 tokens 114MB Balanced speed/accuracy ❌
Gecko 512 110M 768D 512 tokens 116MB Medium context documents ❌
EmbeddingGemma 256 300M 768D 256 tokens 179MB High accuracy, short context βœ…
EmbeddingGemma 512 300M 768D 512 tokens 179MB High accuracy, medium context βœ…
EmbeddingGemma 1024 300M 768D 1024 tokens 183MB Long documents, detailed content βœ…
EmbeddingGemma 2048 300M 768D 2048 tokens 196MB Very long documents βœ…

Performance Comparison (Android Pixel 8 with GPU acceleration):

  • Gecko 64: ~109ms/doc embedding, 130ms search (⚑ fastest - 2.6x faster than EmbeddingGemma)
  • EmbeddingGemma 256: ~286ms/doc embedding, 342ms search (🎯 more accurate - 300M vs 110M params)

Use Cases:

  • βœ… Gecko 64: Real-time search, mobile apps, short queries (≀64 tokens), fast inference
  • βœ… Gecko 256/512: Balanced use cases, general-purpose embeddings, good speed/quality tradeoff
  • βœ… EmbeddingGemma 256/512: High-quality embeddings, semantic search, better accuracy
  • βœ… EmbeddingGemma 1024/2048: Long documents, detailed content, research papers, articles

πŸ› οΈ Model Function Calling Support

Function calling is currently supported by the following models:

βœ… Models with Function Calling Support

  • Gemma 4 (E2B, E4B) - Full function calling support
  • Gemma3n (E2B, E4B) - Full function calling support
  • Gemma 3 1B - Function calling support
  • FunctionGemma 270M - Google's specialized function calling model
  • DeepSeek R1 - Function calling + thinking mode support
  • Qwen models (0.5B, 0.6B, 1.5B) - Full function calling support
  • Phi-4 Mini - Advanced reasoning with function calling support

❌ Models WITHOUT Function Calling Support

  • Gemma 3 270M - Text generation only
  • SmolLM 135M - Text generation only
  • FastVLM 0.5B - Vision model, no function calling

Important Notes:

  • When using unsupported models with tools, the plugin will log a warning and ignore the tools
  • Models will work normally for text generation even if function calling is not supported
  • Check the supportsFunctionCalls property in your model configuration

Platform Support Details 🌐

Feature Comparison

Feature Android iOS Web Desktop Notes
Text Generation βœ… Full βœ… Full βœ… Full βœ… Full All models supported
Image Input (Multimodal) βœ… Full βœ… Full βœ… Full ⚠️ Broken (#684) macOS: model hallucinates
Audio Input βœ… Full βœ… Full ❌ Not supported βœ… Full Gemma3n E2B/E4B
Function Calling βœ… Full βœ… Full βœ… Full ❌ Not supported LiteRT-LM limitation
Thinking Mode βœ… Full βœ… Full βœ… Full βœ… Full DeepSeek & Gemma 4
Stop Generation βœ… Full βœ… Full βœ… Full βœ… Full Cancel mid-process
GPU Acceleration βœ… Full βœ… Full βœ… Full ⚠️ Partial macOS GPU broken
NPU Acceleration βœ… Full ❌ Not supported ❌ Not supported ❌ Not supported Android only (.litertlm)
CPU Backend βœ… Full βœ… Full ❌ Not supported βœ… Full MediaPipe limitation
Streaming Responses βœ… Full βœ… Full βœ… Full βœ… Full Real-time generation
LoRA Support βœ… Full βœ… Full βœ… Full ❌ Not supported LiteRT-LM limitation
Text Embeddings βœ… Full βœ… Full βœ… Full βœ… Full EmbeddingGemma, Gecko
VectorStore (RAG) βœ… SQLite βœ… SQLite βœ… SQLite WASM βœ… SQLite Semantic search, RAG
File Downloads βœ… Background βœ… Background βœ… In-memory βœ… Background Platform-specific
Asset Loading βœ… Full βœ… Full βœ… Full ❌ Not supported Flutter assets N/A
Bundled Resources βœ… Full βœ… Full βœ… Full ❌ Not supported Native bundles only
External Files (FileSource) βœ… Full βœ… Full ❌ Not supported βœ… Full No local FS on web

Web Platform Specifics

Authentication

  • Required for gated models: Gemma3n, Gemma 3 1B/270M, EmbeddingGemma
  • Configuration: Use FlutterGemma.initialize(huggingFaceToken: '...') or pass token per-download
  • Storage: Tokens stored in browser memory (not localStorage)

File Handling

  • Downloads: Creates blob URLs in browser memory (no actual files)
  • Storage: IndexedDB via WebFileSystemService
  • FileSource: Only works with HTTP/HTTPS URLs or assets/ paths
  • Local file paths: ❌ Not supported (browser security restriction)

Web Storage Modes (v0.12.1+)

Three Storage Modes:

1. Cache API Mode (default, WebStorageMode.cacheApi):

  • Uses browser Cache API with Blob URLs
  • Models persist across browser restarts
  • Best for models <2GB

2. Streaming Mode (WebStorageMode.streaming):

  • Uses OPFS with ReadableStream
  • Bypasses browser 2GB ArrayBuffer limit
  • Required for large models (E4B 4GB+, 7B, 27B)
  • Requires Chrome 86+, Edge 86+, Safari 15.2+

3. Ephemeral Mode (WebStorageMode.none):

  • Models stored in memory only
  • Cleared when browser closes
  • For testing/demos
// Default: Cache API for small models
FlutterGemma.initialize(webStorageMode: WebStorageMode.cacheApi);

// Streaming for large models (>2GB)
FlutterGemma.initialize(webStorageMode: WebStorageMode.streaming);

// Check if streaming is supported
final supported = await FlutterGemma.isStreamingSupported();

Backend Support

CORS Configuration

  • Required for custom servers: Enable CORS headers on your model hosting server
  • Firebase Storage: See CORS configuration docs
  • HuggingFace: CORS already configured correctly

Memory Limitations

  • Large models: May hit browser memory limits (2GB typical)
  • Recommended: Use smaller models (1B-2B) for web platform
  • Best models for web:
    • Gemma 3 270M (300MB)
    • Gemma 3 1B (500MB-1GB)
    • Gemma3n E2B (3GB) - requires 6GB+ device RAM

Browser Cache Storage Limits

Browser Max Model Size Notes
Chrome/Firefox ~2 GB ArrayBuffer limit
Safari ~50 MB ⚠️ Not suitable

Mobile Platform Specifics

Android

  • GPU Support: Requires OpenGL libraries in AndroidManifest.xml
  • ProGuard: Automatic rules included for release builds
  • Storage: Local file system in app documents directory

iOS

  • Minimum version: iOS 16.0 required for MediaPipe GenAI
  • Memory entitlements: Required for large models (see Setup section)
  • Linking: Static linking required (use_frameworks! :linkage => :static)
  • Storage: Local file system in app documents directory
  • Embedding models: Supported via TensorFlowLiteC β€” no extra Podfile configuration needed

The full and complete example you can find in example folder

Important Considerations

  • Model Size: Larger models (such as 7b and 7b-it) might be too resource-intensive for on-device inference.
  • Function Calling Support: Gemma3n and DeepSeek models support function calling. Other models will ignore tools and show a warning.
  • Thinking Mode: Only DeepSeek models support thinking mode. Enable with isThinking: true and modelType: ModelType.deepSeek.
  • Multimodal Models: Gemma3n models with vision support require more memory and are recommended for devices with 8GB+ RAM.
  • iOS Memory Requirements: Large models require memory entitlements in Runner.entitlements and minimum iOS 16.0.
  • LoRA Weights: They provide efficient customization without the need for full model retraining.
  • Development vs. Production: For production apps, do not embed the model or LoRA weights within your assets. Instead, load them once and store them securely on the device or via a network drive.
  • Web Models: Currently, Web support is available only for GPU backend models. Multimodal support is fully implemented.
  • Image Formats: The plugin automatically handles common image formats (JPEG, PNG, etc.) when using Message.withImage().

πŸ›Ÿ Troubleshooting

Multimodal Issues:

  • Ensure you're using a multimodal model (Gemma3n E2B/E4B)
  • Set supportImage: true when creating model and chat
  • Check device memory - multimodal models require more RAM

Performance:

  • Use GPU backend for better performance with multimodal models
  • Consider using CPU backend for text-only models on lower-end devices

Memory Issues:

  • iOS: Ensure Runner.entitlements contains memory entitlements (see iOS setup)
  • iOS: Set minimum platform to iOS 16.0 in Podfile
  • Reduce maxTokens if experiencing memory issues
  • Use smaller models (1B-2B parameters) for devices with <6GB RAM
  • Close sessions and models when not needed
  • Monitor token usage with sizeInTokens()

iOS Build Issues:

  • Ensure minimum iOS version is set to 16.0 in Podfile
  • Use static linking: use_frameworks! :linkage => :static
  • Clean and reinstall pods: cd ios && pod install --repo-update
  • Check that all required entitlements are in Runner.entitlements

Advanced Usage

ModelThinkingFilter (Advanced)

For advanced users who need to manually process model responses, the ModelThinkingFilter class provides utilities for cleaning model outputs:

import 'package:flutter_gemma/core/extensions.dart';

// Clean response based on model type
String cleanedResponse = ModelThinkingFilter.cleanResponse(
  rawResponse,
  ModelType.deepSeek
);

// The filter automatically removes model-specific tokens like:
// - <end_of_turn> tags (Gemma models)
// - <think>...</think> blocks (DeepSeek)
// - <|channel>thought\n...<channel|> blocks (Gemma 4 E2B/E4B)
// - Extra whitespace and formatting

This is automatically handled by the chat API, but can be useful for custom inference implementations.

β˜• Support the Project

If you find Flutter Gemma useful and want to support its development, consider buying me a coffee! Your support helps me:

  • πŸ”§ Maintain and improve the plugin
  • πŸ“š Keep documentation up-to-date
  • πŸ› Fix bugs and resolve issues faster
  • ✨ Add new features and model support
  • πŸ§ͺ Test on more devices and platforms

ko-fi

Every contribution, no matter how small, makes a difference. Thank you for your support! πŸ’™

About

The Flutter plugin allows running the Gemma AI model locally on a device from a Flutter application.

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors