Stable Diffusion 3 如何下载安装使用及性能优化_vr

stable diffusion 3

stable diffusion 3（sd3），stability ai最新推出的stable diffusion模型系列，现在可以在hugging face hub上使用，并且可以与diffusers一起使用。

今天发布的模型是stable diffusion 3 medium，包含20亿参数。

下载地址

今天，stable diffusion 3 medium模型正式开源，下载地址：https://huggingface.co/stabilityai/stable-diffusion-3-medium

下载慢的话也可以使用国内网盘下载：
https://pan.quark.cn/s/ce4c98622c96

sd3的新特性？

模型

sd3是一个潜在的扩散模型，由三种不同的文本编码器（clip l/14，openclip bigg/14和t5-v1.1-xxl）、一个新颖的多模态扩散变换器（mmdit）模型和一个与stable diffusion xl中使用的相似的16通道自动编码器模型组成。

sd3将文本输入和像素潜在变量作为一系列嵌入序列处理。位置编码被添加到潜在变量的2x2块上，然后将这些块展平为块编码序列。这个序列连同文本编码序列一起被输入到mmdit块中，它们被嵌入到一个共同的维度，连接起来，并通过一系列调制注意力和多层感知器（mlps）传递。

为了解释两种模态之间的差异，mmdit块使用两组不同的权重将文本和图像序列嵌入到共同的维度。这些序列在注意力操作之前连接，允许两种表示在各自的空间中工作，同时在注意力操作期间考虑另一个。

sd3还利用其clip模型的汇总文本嵌入作为其时间步条件的一部分。这些嵌入首先被连接并添加到时间步嵌入中，然后传递到每个mmdit块。

使用rectified flow matching进行训练

除了架构变化外，sd3应用了一个条件流匹配目标来训练模型。在这种方法中，前向噪声过程被定义为一个连接数据和噪声分布的直线的整流流。

整流流匹配采样过程更简单，并且在减少采样步骤数量时表现良好。为了支持sd3的推理，我们引入了一个新的调度器（flowmatcheulerdiscretescheduler），它具有整流流匹配公式和欧拉方法步骤。它还通过一个shift参数实现了时间步调度的分辨率依赖性偏移。增加shift值可以更好地处理更高分辨率的噪声缩放。建议对20亿模型使用shift=3.0。

要快速尝试sd3，请参考下面的应用程序：

使用diffusers与sd3

要使用diffusers与sd3，确保升级到最新的diffusers版本。

pip install --upgrade diffusers

由于模型是受限制的，在使用diffusers之前，您需要先访问hugging face页面上的stable diffusion 3 medium页面，填写表单并接受限制。一旦您进入，您需要登录，以便您的系统知道您已经接受了限制。使用以下命令登录：

下面的代码片段将下载sd3的20亿参数版本，精度为fp16。这是stability ai发布的原始检查点中使用的格式，也是推荐运行推理的方式。

文本到图像

import torch
from diffusers import stablediffusion3pipeline

pipe = stablediffusion3pipeline.from_pretrained("stabilityai/stable-diffusion-3-medium-diffusers", torch_dtype=torch.float16)
pipe = pipe.to("cuda")

image = pipe(
    "a cat holding a sign that says hello world",
    negative_prompt="",
    num_inference_steps=28,
    guidance_scale=7.0,
).images[0]
image

图像到图像

import torch
from diffusers import stablediffusion3img2imgpipeline
from diffusers.utils import load_image

pipe = stablediffusion3img2imgpipeline.from_pretrained("stabilityai/stable-diffusion-3-medium-diffusers", torch_dtype=torch.float16)
pipe = pipe.to("cuda")

init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png")
prompt = "cat wizard, gandalf, lord of the rings, detailed, fantasy, cute, adorable, pixar, disney, 8k"
image = pipe(prompt, image=init_image).images[0]
image

您可以在这里查看sd3的文档。

sd3的内存优化

sd3使用三个文本编码器，其中一个是非常大尺寸的t5-xxl模型。这使得即使使用fp16精度，在少于24gb vram的gpu上运行模型也具有挑战性。

为了解决这个问题，diffusers集成提供了内存优化，允许sd3在更广泛的设备上运行。

使用模型卸载进行推理

diffusers中最基本的内存优化允许您在推理期间将模型组件卸载到gpu，以节省内存，同时看到推理延迟的轻微增加。模型卸载只会在需要执行时将模型组件移动到gpu，同时保持其余组件在cpu上。

import torch
from diffusers import stablediffusion3pipeline

pipe = stablediffusion3pipeline.from_pretrained("stabilityai/stable-diffusion-3-medium-diffusers", torch_dtype=torch.float16)
pipe.enable_model_cpu_offload()

prompt = "smiling cartoon dog sits at a table, coffee mug on hand, as a room goes up in flames. 'this is fine,' the dog assures himself."
image = pipe(prompt).images[0]

t5-xxl文本编码器

在推理期间移除内存密集型的47亿参数t5-xxl文本编码器可以显著降低sd3的内存需求，只有轻微的性能损失。

import torch
from diffusers import stablediffusion3pipeline

pipe = stablediffusion3pipeline.from_pretrained("stabilityai/stable-diffusion-3-medium-diffusers", text_encoder_3=none, tokenizer_3=none, torch_dtype=torch.float16)
pipe = pipe.to("cuda")

prompt = "smiling cartoon dog sits at a table, coffee mug on hand, as a room goes up in flames. 'this is fine,' the dog assures himself."
image = pipe("").images[0]

使用t5-xxl模型的量化版本

您可以使用bitsandbytes库以8位加载t5-xxl模型，以进一步降低内存需求。

import torch
from diffusers import stablediffusion3pipeline
from transformers import t5encodermodel, bitsandbytesconfig

# 确保您已经安装了`bitsandbytes`。
quantization_config = bitsandbytesconfig(load_in_8bit=true)

model_id = "stabilityai/stable-diffusion-3-medium-diffusers"
text_encoder = t5encodermodel.from_pretrained(
    model_id,
    subfolder="text_encoder_3",
    quantization_config=quantization_config,
)
pipe = stablediffusion3pipeline.from_pretrained(
    model_id,
    text_encoder_3=text_encoder,
    device_map="balanced",
    torch_dtype=torch.float16
)

您可以在这里找到完整的代码片段。

内存优化总结

所有基准测试都是在80gb vram的a100 gpu上使用2b版本的sd3模型进行的，使用fp16精度和pytorch 2.3。

我们运行了10次管道推理调用，并测量了管道的平均峰值内存使用量和执行20次扩散步骤所需的平均时间。

sd3的性能优化

为了提高推理延迟，我们可以使用torch.compile()来获得vae和transformer组件的优化计算图。

import torch
from diffusers import stablediffusion3pipeline

torch.set_float32_matmul_precision("high")
torch._inductor.config.conv_1x1_as_mm = true
torch._inductor.config.coordinate_descent_tuning = true
torch._inductor.config.epilogue_fusion = false
torch._inductor.config.coordinate_descent_check_all_directions = true

pipe = stablediffusion3pipeline.from_pretrained(
    "stabilityai/stable-diffusion-3-medium-diffusers",
    torch_dtype=torch.float16
).to("cuda")
pipe.set_progress_bar_config(disable=true)

pipe.transformer.to(memory_format=torch.channels_last)
pipe.vae.to(memory_format=torch.channels_last)

pipe.transformer = torch.compile(pipe.transformer, mode="max-autotune", fullgraph=true)
pipe.vae.decode = torch.compile(pipe.vae.decode, mode="max-autotune", fullgraph=true)

# 预热
prompt = "a photo of a cat holding a sign that says hello world",
for _ in range(3):
    _ = pipe(prompt=prompt, generator=torch.manual_seed(1))

#