Qwen3: Alibaba Cloud's Advanced Large Language Model Series
Summary
Qwen3 is a powerful series of large language models developed by the Qwen team at Alibaba Cloud. It offers advanced capabilities in reasoning, multilingual support, and long-context understanding, available in various sizes and modes for diverse applications. This repository provides comprehensive resources for running, deploying, and building with Qwen3 models.
Repository Info
Tags
Click on any tag to explore related repositories
Introduction
Qwen3 represents the latest generation of large language models from the Qwen team at Alibaba Cloud. Building on the success of previous iterations, Qwen3 introduces significant enhancements in general capabilities, including instruction following, logical reasoning, text comprehension, mathematics, science, coding, and tool usage. The series features both dense and Mixture-of-Expert (MoE) models, available in various sizes, and supports seamless switching between a dedicated "thinking mode" for complex tasks and a "non-thinking" (instruct) mode for efficient, general-purpose chat. Notably, Qwen3-2507 models boast enhanced 256K long-context understanding, extendable up to 1 million tokens.
Installation
To get started with Qwen3, the recommended approach is to use the Hugging Face Transformers library. Ensure you have transformers>=4.51.0 installed.
pip install transformers torch
Alternatively, Qwen3 models are well-supported by various local inference frameworks:
- llama.cpp: Requires
llama.cpp>=b5401. Follow the instructions in the official documentation for compilation and usage. - Ollama: Install Ollama (v0.9.0 or higher recommended) and run
ollama serve, thenollama run qwen3:8b(or other sizes). - LM Studio: Directly use Qwen3 GGUF files within LM Studio.
- MLX LM: For Apple Silicon users,
mlx-lm>=0.24.0supports Qwen3 models. - OpenVINO: For Intel CPU/GPU, use the OpenVINO toolkit.
Examples
Here are basic examples demonstrating how to use Qwen3 models with Hugging Face Transformers.
Qwen3-Instruct-2507 (Non-Thinking Mode)
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Qwen/Qwen3-30B-A3B-Instruct-2507"
# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
# prepare the model input
prompt = "Give me a short introduction to large language model."
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
# conduct text completion
generated_ids = model.generate(
**model_inputs,
max_new_tokens=16384
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
content = tokenizer.decode(output_ids, skip_special_tokens=True)
print("content:", content)
Qwen3-Thinking-2507 (Thinking Mode)
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Qwen/Qwen3-30B-A3B-Thinking-2507"
# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
# prepare the model input
prompt = "Give me a short introduction to large language model."
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
# conduct text completion
generated_ids = model.generate(
**model_inputs,
max_new_tokens=32768
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
# parsing thinking content
try:
# rindex finding 151668 (</think>)
index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
index = 0
thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")
print("thinking content:", thinking_content) # no opening <think> tag
print("content:", content)
Why Use It
Qwen3 offers a compelling solution for various AI applications due to its advanced features:
- State-of-the-Art Performance: Achieves significant improvements across general capabilities, including logical reasoning, mathematics, science, coding, and tool usage.
- Flexible Architectures: Available in both dense and Mixture-of-Expert (MoE) models, providing options for different performance and efficiency needs.
- Dual Operating Modes: Seamlessly switch between a highly capable "thinking mode" for complex problem-solving and an efficient "instruct mode" for general conversations.
- Extended Context Window: Supports up to 1 million tokens, enabling deep understanding and generation for ultra-long inputs.
- Multilingual Expertise: Strong capabilities in over 100 languages and dialects, making it suitable for global applications.
- Robust Deployment Options: Supported by popular inference frameworks like SGLang, vLLM, and TensorRT-LLM, facilitating large-scale deployment.
- Open-Source and Community-Driven: Licensed under Apache 2.0, fostering an open environment for development and research.
Links
- GitHub Repository: https://github.com/QwenLM/Qwen3
- Qwen Chat: https://chat.qwen.ai/
- Hugging Face: https://huggingface.co/Qwen
- ModelScope: https://modelscope.cn/organization/qwen
- Paper: https://arxiv.org/abs/2505.09388
- Documentation: https://qwen.readthedocs.io/
- Demo: https://huggingface.co/spaces/Qwen/Qwen3-Demo
- Discord: https://discord.gg/CV4E9rpNSD