Qwen3: Alibaba Cloud's Advanced Large Language Model Series

Introduction

Qwen3 represents the latest generation of large language models from the Qwen team at Alibaba Cloud. Building on the success of previous iterations, Qwen3 introduces significant enhancements in general capabilities, including instruction following, logical reasoning, text comprehension, mathematics, science, coding, and tool usage. The series features both dense and Mixture-of-Expert (MoE) models, available in various sizes, and supports seamless switching between a dedicated "thinking mode" for complex tasks and a "non-thinking" (instruct) mode for efficient, general-purpose chat. Notably, Qwen3-2507 models boast enhanced 256K long-context understanding, extendable up to 1 million tokens.

Installation

To get started with Qwen3, the recommended approach is to use the Hugging Face Transformers library. Ensure you have transformers>=4.51.0 installed.

pip install transformers torch

Alternatively, Qwen3 models are well-supported by various local inference frameworks:

llama.cpp: Requires llama.cpp>=b5401. Follow the instructions in the official documentation for compilation and usage.
Ollama: Install Ollama (v0.9.0 or higher recommended) and run ollama serve, then ollama run qwen3:8b (or other sizes).
LM Studio: Directly use Qwen3 GGUF files within LM Studio.
MLX LM: For Apple Silicon users, mlx-lm>=0.24.0 supports Qwen3 models.
OpenVINO: For Intel CPU/GPU, use the OpenVINO toolkit.

Examples

Here are basic examples demonstrating how to use Qwen3 models with Hugging Face Transformers.

Qwen3-Instruct-2507 (Non-Thinking Mode)

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen3-30B-A3B-Instruct-2507"

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

# prepare the model input
prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=16384
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() 

content = tokenizer.decode(output_ids, skip_special_tokens=True)

print("content:", content)

Qwen3-Thinking-2507 (Thinking Mode)

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen3-30B-A3B-Thinking-2507"

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

# prepare the model input
prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() 

# parsing thinking content
try:
    # rindex finding 151668 (</think>)
    index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
    index = 0

thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")

print("thinking content:", thinking_content)  # no opening <think> tag
print("content:", content)

Why Use It

Qwen3 offers a compelling solution for various AI applications due to its advanced features:

State-of-the-Art Performance: Achieves significant improvements across general capabilities, including logical reasoning, mathematics, science, coding, and tool usage.
Flexible Architectures: Available in both dense and Mixture-of-Expert (MoE) models, providing options for different performance and efficiency needs.
Dual Operating Modes: Seamlessly switch between a highly capable "thinking mode" for complex problem-solving and an efficient "instruct mode" for general conversations.
Extended Context Window: Supports up to 1 million tokens, enabling deep understanding and generation for ultra-long inputs.
Multilingual Expertise: Strong capabilities in over 100 languages and dialects, making it suitable for global applications.
Robust Deployment Options: Supported by popular inference frameworks like SGLang, vLLM, and TensorRT-LLM, facilitating large-scale deployment.
Open-Source and Community-Driven: Licensed under Apache 2.0, fostering an open environment for development and research.

Qwen3: Alibaba Cloud's Advanced Large Language Model Series

Summary

Repository Info

Tags