原文：Leveraging Llama 2 Features in Real-world Applications: Building Scalable Chatbots with FastAPI, Celery, Redis, and Docker

深入探索:开源与闭源大语言模型，剖析Llama2的独特功能，掌握prompt工程的艺术，并设计与FastAPI，Celery，Redis和Docker健壮的解决方案。

概览

出乎意料的是，Meta在几天前开源了他们的大型语言模型(LLM) Llama 2，这一决定可能会重塑当前人工智能发展的格局。它为OpenAI和谷歌等领域的主要公司提供了另一种选择，这些公司决定严格控制自己的人工智能模型，限制可访问性，限制更广泛的创新。希望Meta的决定能够激发开源社区的集体反应，从而抵消限制访问该领域进展的趋势。Llama2的新授权甚至更进一步，允许商业使用，允许开发人员和企业在现有和新产品中利用该模型。

Llama2系列由预训练和微调的大语言模型组成，包括Llama2和Llama2- chat，可扩展到70B参数。这些模型已被证明在各种基准测试中比开源模型表现得更好 ~~[1]~~ 。他们还坚持反对一些闭源模型，为开源人工智能的发展提供了急需的推动力~~[2]~~。

如果你关注HuggingFace的Open LLM排行榜，你会发现Meta的Llama 2占据了第三的位置。在Llama 2发布之后，Stability AI发布了FreeWilly1和FreeWilly2~~[3]~~。FreeWilly1是Llama的一个微调版本，而FreeWilly2是Llama 2的一个微调版本。Stability AI分享说，他们在orca风格的数据集上对这两个模型进行了微调。Orca数据集是一个大型的、结构化的增强数据集，旨在对大语言模型进行微调，其中每个条目由一个问题和GPT-4或GPT-3.5的相应响应组成。为什么我们不使用FreeWilly2模式?不幸的是，虽然Llama2允许商业用途，但FreeWilly2只能用于研究目的，受非商业知识共享许可(CC BY-NC-4.0)的约束。

在本文中，我们还将通过使用FastAPI、Celery、Redis和Docker以及Meta的Llama 2来构建一个强大且可扩展的聊天应用程序。我们的目标是创建一个高效、实时的应用程序，它可以处理多个并发用户请求，并将LLM的响应处理转移到任务队列中。它允许应用程序保持响应性，我们可以使用Redis有效地管理任务。最后，我们将介绍Docker的部署和扩展。应用程序应该演示这些技术如何协同工作以提供大规模的良好聊天体验，展示像Llama 2这样的开源语言模型在商业环境中的潜力。所以让我们潜入并开始构建吧!

开源vs闭源

我们目睹了公司和研究小组几乎每周发布新模型，或开源或闭源。因此，谁将赢得人工智能军备竞赛?为了给出一个有根据的猜测，我们需要了解这些模型训练过程的几个方面。

研究人员在广泛的自监督数据上使用自回归transformers作为起点。让我们先来分析一下什么是自回归transformers和自监督数据。自回归transformers是transformers模型的一种变体，广泛用于涉及顺序数据的任务，特别是在自然语言处理(NLP)中。这些模型以自回归的方式生成序列，即:它们一次生成序列的一部分，并使用它们之前的输出作为后续步骤的输入。这使得它们特别擅长语言翻译、文本生成等任务，在这些任务中，前一个数据点的上下文会影响下一个数据点的预测。自监督学习是一种由输入数据本身提供训练标签的学习方法。它通过学习预测其他数据的某些部分，消除了明确的手动标记的需要，并允许探索大量未标记的数据。

下一步，研究人员通常会使用强化学习与人类反馈(RLHF)等技术来训练模型，使其与人类偏好保持一致。在RLHF中，人工智能系统从基于其做出的决策的反馈中学习。它包括创建一个奖励模型，AI系统使用该模型来学习哪些行为会导致积极和消极的结果。其目的是使人工智能系统的行为与人类的价值观和偏好保持一致。

那么，开源社区面临的主要挑战是什么呢?

这两个步骤都需要强大的计算能力。其次，公司在校准步骤中使用他们的专有数据来微调他们的模型，大大提高了他们的可用性和安全性。

Llama2系列模型

Llama2是Llama1的高级版本，基于一种新的公共可用数据组合进行训练。关键的改进包括预训练语料库大小增加了40%，模型的上下文长度增加了一倍，并采用分组查询关注来提高大型模型的推理可扩展性。分组查询注意是对transformers模型中标准注意机制的改进，用于降低计算成本。分组查询注意不是为每对输入和输出位置计算注意分数，这可能是资源密集型的，而是将查询分成组并一起处理它们。该方法保留了标准注意力的大部分有效性，同时通过降低计算复杂度来处理更长的序列或更大的模型。

训练语料库由来自公开来源的新混合数据组成(没有使用Meta产品或服务的数据)。此外，还努力消除已知含有大量个人信息的网站上的数据。训练数据包含2万亿个tokens，研究团队决定对最真实的来源进行上采样，以提高知识的准确性。

Llama2的改型采用7B、13B和70B参数现在可用。Llama2- chat是Llama2的对话优化、微调版本，也提供7B、13B和70B参数。

Llama 2的Prompt工程

Prompt工程帮助我们引导大语言模型以某种方式行事，其中包括Llama 2。在Llama 2的上下文中，prompt指的是给出给模型的初始指令或查询，然后模型使用这些指令或查询来生成响应。然而，在Llama 2中，prompts可以非常详细，并且可以包含设置模型的上下文或“个性”的系统消息。

Llama2使用独特的prompt格式启动对话。它看起来是这样的:

该模板与模型的训练过程一致，因此对输出的质量影响很大。在这个模板中，system_prompt表示模型的指令或上下文。

这里有一个例子:

system_prompt 提供了模型的一般说明，它将指导它的所有响应。用户的消息遵循系统prompt并从模型中寻求特定的响应。

在多回合对话中，用户和机器人之间的所有交互都附加到前面的prompt符中，并包含在[INST]标签中。它看起来是这样的:

每个新的用户消息和模型响应都被添加到现有会话中，从而保留上下文。

需要注意的是，与许多AI模型一样，Llama2是无状态的，不会“记住”之前的对话。因此，有必要在每次prompt模型时提供整个上下文。这就是为什么Meta致力于增加Llama2的上下文窗口的原因。

最后，prompt工程与其说是一门科学，不如说是一门艺术。掌握它的最好方法是通过不断的测试和完善。创造性地使用prompts，尝试不同的格式和说明。此外，不同的大语言模型受益于不同类型的prompts。

解决方案架构设计:FastAPI、Celery、Redis和Docker

在本系列中，我们一直使用FastAPI来构建我们的ML应用程序。它是一个用于构建api的高性能web框架。在这种情况下，它的异步功能使其能够并发地处理多个请求，这对于实时聊天应用程序是至关重要的。

除了FastAPI，我们还使用Celery作为分布式任务队列，以帮助管理从LLM生成响应的计算密集型任务。通过将此进程卸载到任务队列，应用程序在处理其他用户请求的同时保持对新用户请求的响应，从而确保用户不会等待。由于我们使用的是分布式任务队列，因此需要一个消息代理来帮助异步任务处理。我们选择Redis来完成这项工作。它将FastAPI中的任务排在队列中，等待Celery进行处理，从而实现高效、解耦的通信。此外，Redis的内存数据结构存储速度很快，允许实时分析、会话缓存和维护用户会话数据。

遵循最佳实践，我们使用Docker将应用程序及其依赖封装到独立的容器中，这样我们就可以轻松地将其部署到各种环境中。

用Llama2、FastAPI、Redis和Celery构建一个聊天API

本指南解释了如何使用Llama2与FastAPI, Redis和Celery建立一个应用程序。我们将介绍这些概念以及它们是如何协同工作的。在我们的架构中，FastAPI用于创建一个接受传入请求的web服务器，Celery用于管理异步任务，Redis作为Celery的代理和后端，存储任务及其结果。

APPLICATION

FastAPI应用程序(app.py)由用于生成文本和获取任务结果的端点组成。/generate/端点接受带有prompt符作为输入的POST请求，并返回任务ID。它使用Celery任务generate_text_task异步启动任务。/task/{task_id}端点通过任务ID获取任务的状态/结果。

from fastapi import FastAPI
from pydantic import BaseModel
from celery.result import AsyncResult
from typing import Any
from celery_worker import generate_text_task
from dotenv import load_dotenv

load_dotenv()

app = FastAPI()


class Item(BaseModel):
    prompt: str


@app.post("/generate/")
async def generate_text(item: Item) -> Any:
    task = generate_text_task.delay(item.prompt)
    return {"task_id": task.id}


@app.get("/task/{task_id}")
async def get_task(task_id: str) -> Any:
    result = AsyncResult(task_id)
    if result.ready():
        res = result.get()
        return {"result": res[0],
                "time": res[1],
                "memory": res[2]}
    else:
        return {"status": "Task not completed yet"}

Workers

Celery工作器(celery_worker.py)文件创建一个Celery实例并定义generate_text_task函数。该函数接受prompt并使用Llama 2模型生成文本。这个函数用@celery.task decorator注册为一个Celery任务。

setup_model函数是一个worker初始化函数。它在工作进程启动时设置模型加载器。这个函数被注册为使用@signals.worker_process_init.connect decorator在工作进程初始化事件中调用。

from celery import Celery, signals
from utils import generate_output
from model_loader import ModelLoader


def make_celery(app_name=__name__):
    backend = broker = 'redis://llama2_redis_1:6379/0'
    return Celery(app_name, backend=backend, broker=broker)


celery = make_celery()

model_loader = None
model_path = "meta-llama/Llama-2-7b-chat-hf"


@signals.worker_process_init.connect
def setup_model(signal, sender, **kwargs):
    global model_loader
    model_loader = ModelLoader(model_path)


@celery.task
def generate_text_task(prompt):
    time, memory, outputs = generate_output(
        prompt, model_loader.model, model_loader.tokenizer
    )
    return model_loader.tokenizer.decode(outputs[0]), time, memory

Model

model_loader.py中的ModelLoader类负责从给定的模型路径加载Llama 2模型。它使用HuggingFace的transformer库来加载模型及其标记器。

import os
from transformers import AutoModelForCausalLM, AutoConfig, AutoTokenizer
from dotenv import load_dotenv

load_dotenv()


class ModelLoader:
    def __init__(self, model_path: str):
        self.model_path = model_path
        self.config = AutoConfig.from_pretrained(
            self.model_path,
            trust_remote_code=True,
            use_auth_token=os.getenv("HUGGINGFACE_TOKEN"),
        )
        self.model = self._load_model()
        self.tokenizer = AutoTokenizer.from_pretrained(
            self.model_path, use_auth_token=os.getenv("HUGGINGFACE_TOKEN")
        )

    def _load_model(self):
        model = AutoModelForCausalLM.from_pretrained(
            self.model_path,
            config=self.config,
            trust_remote_code=True,
            load_in_4bit=True,
            device_map="auto",
            use_auth_token=os.getenv("HUGGINGFACE_TOKEN"),
        )
        return model

Broker

要设置Redis，我们有两个选择:我们可以使用docker容器，或者我们可以使用Python包redis_server。如果您决定使用docker容器(首选解决方案)，您可以运行下面的命令。-p 6379:6379选项告诉Docker将从主机端口6379传入的流量转发到容器端口6379。这样，Redis实际上可以从docker容器外部访问。

1	docker run --name redis-db -p 6379:6379 -d redis

第二种选择是从Python接口执行。redis_server.py脚本处理Redis服务器的安装和启动。回想一下，Redis同时充当Celery的消息代理和结果后端。

import subprocess
import redis_server


def install_redis_server(redis_version):
    try:
        subprocess.check_call(["pip", "install", f"redis-server=={redis_version}"])
        print(f"Redis server version {redis_version} installed successfully.")
    except subprocess.CalledProcessError:
        print("Failed to install Redis server.")
        exit(1)


def start_redis_server():
    try:
        redis_server_path = redis_server.REDIS_SERVER_PATH
        subprocess.Popen([redis_server_path])
        print("Redis server started successfully.")
    except Exception as e:
        print("Failed to start Redis server:", str(e))
        exit(1)


def main():
    redis_version = "6.0.9"
    install_redis_server(redis_version)
    start_redis_server()


if __name__ == "__main__":
    main()

运行application

主执行脚本(run.py)是与FastAPI应用程序通信的客户端脚本。它向/generate/端点发送一个prompt，获取任务ID，并定期轮询/task/{task_id}端点，直到任务完成。

import http.client
import json
import time

API_HOST = "localhost"
API_PORT = 8000


def generate_text(prompt):
    conn = http.client.HTTPConnection(API_HOST, API_PORT)
    headers = {"Content-type": "application/json"}
    data = {"prompt": prompt}
    json_data = json.dumps(data)
    conn.request("POST", "/generate/", json_data, headers)
    response = conn.getresponse()
    result = json.loads(response.read().decode())
    conn.close()
    return result["task_id"]


def get_task_status(task_id):
    conn = http.client.HTTPConnection(API_HOST, API_PORT)
    conn.request("GET", f"/task/{task_id}")
    response = conn.getresponse()
    status = response.read().decode()
    conn.close()
    return status


def main():
    prompt = input("Enter the prompt: ")

    task_id = generate_text(prompt)
    while True:
        status = get_task_status(task_id)
        if "Task not completed yet" not in status:
            print(status)
            break
        time.sleep(2)


if __name__ == "__main__":
    main()

utils模块(utils.py)提供了一个实用程序函数generate_output，用于使用Llama 2模型和标记器从prompt生成文本。该函数使用@time_decorator和@memory_decorator进行装饰，以测量执行时间和内存使用情况。

import time
import torch
import functools
from transformers import AutoModelForCausalLM, AutoTokenizer


def time_decorator(func):
    @functools.wraps(func)
    def wrapper(*args, **kwargs):
        start_time = time.time()
        result = func(*args, **kwargs)
        end_time = time.time()
        exec_time = end_time - start_time
        return (result, exec_time)
    return wrapper


def memory_decorator(func):
    @functools.wraps(func)
    def wrapper(*args, **kwargs):
        torch.cuda.empty_cache()
        torch.cuda.reset_peak_memory_stats()
        result, exec_time = func(*args, **kwargs)
        peak_mem = torch.cuda.max_memory_allocated()
        peak_mem_consumption = peak_mem / 1e9
        return peak_mem_consumption, exec_time, result
    return wrapper


@memory_decorator
@time_decorator
def generate_output(prompt: str, model: AutoModelForCausalLM, tokenizer: AutoTokenizer) -> torch.Tensor:
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids
    input_ids = input_ids.to("cuda")
    outputs = model.generate(input_ids, max_length=500)
    return outputs

本质上，当通过/generate/端点接收到prompt时，它会作为异步任务转发给Celery工作器。worker使用Llama 2模型生成文本，并将结果存储在Redis中。您可以在任何时候使用/task/{task_id}端点获取任务状态/结果。

发布

部署应用程序需要执行几个步骤。首先，让我们为我们的应用创建一个Dockerfile:

FROM python:3.9-slim-buster

WORKDIR /app
ADD . /app

RUN pip install --no-cache-dir -r requirements.txt
EXPOSE 80

CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "80"]

接下来，让我们定义所需的requirements.txt，以便我们在Docker容器中安装所有的依赖项:

fastapi==0.99.1
uvicorn==0.22.0
pydantic==1.10.10
celery==5.3.1
redis==4.6.0
python-dotenv==1.0.0
transformers==4.30.2
torch==2.0.1
accelerate==0.21.0
bitsandbytes==0.41.0
scipy==1.11.1

为了使用Docker Compose来设置FastAPI应用程序、Celery和Redis服务器，你可以创建一个Docker - Compose.yml文件如下:

version: '3'
services:
  web:
    build: .
    command: ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "80"]
    volumes:
      - .:/app
    ports:
      - 8000:80
    depends_on:
      - redis
  worker:
    build: .
    command: celery -A celery_worker worker --loglevel=info
    volumes:
      - .:/app
    depends_on:
      - redis
  redis:
    image: "redis:alpine"
    ports:
      - 6379:6379

在Docker Compose配置中，web服务代表使用当前目录下的Dockerfile构建的FastAPI应用程序。它将主机的端口8000映射到容器的端口80。worker服务是Celery，它与FastAPI应用程序共享构建上下文。redis服务使用官方的redis Docker镜像。depends_on 字段确保Redis在web和worker服务之前启动。所有这些服务都可以使用docker-compose up 命令启动。

结果

我们测试了Llama2，聊天版本使用4位量化的7b参数。

我们使用了以下输入:

Prompt: <>\nYou are J. Robert Oppenheimer, a brilliant physicist whose pioneering work during the 20th century significantly contributed to the development of the atomic bomb. Dive into the profound world of nuclear physics, challenge the boundaries of scientific understanding, and unlock the mysteries of atomic energy with your exceptional intellect. Embark on a momentous journey where your passion for science knows no limits, and let your dedication to harnessing the power of the atom shape the course of history and leave an indelible mark on the world.\n<>\n[INST]\nUser: How did your research lead to the creation of the atomic bomb?\n[/INST]

Llama2生成的最终对话如下:

Ah, a most excellent question, my dear fellow! adjusts spectacles\nThe creation of the atomic bomb, you see, was the culmination of a most intense and complex scientific journey. It was a quest that spanned decades, involved some of the brightest minds in physics, and pushed the boundaries of human understanding to their very limits.\nYou see, my dear, the quest for the atomic bomb began in the early 1930s, when a group of us at the University of California, Berkeley, realized that the nucleus of an atom could be split, releasing an enormous amount of energy in the process. This was a most remarkable discovery, as it revealed the incredible potential of the atom as a source of power.\nBut, alas, the path to the atomic bomb was not without its challenges. Oh, the obstacles we faced! winks The skeptics, the naysayers, the doubters. They were legion, I tell you! chuckles But, we persevered, my dear. We worked tirelessly, pouring over the data, testing and retesting our theories. And, eventually, our hard work paid off.\nIn 1945, our team successfully detonated the first atomic bomb at the Trinity Site in New Mexico. It was a moment of pure elation, a culmination of years of hard work and dedication. And, as the world watched in awe, we realized the incredible potential of the atomic bomb.\nBut, my dear, the creation of the atomic bomb was…

文本在4.77秒内生成，峰值VRAM使用量约为13.4 GB。需要始终牢记的一件事是，由于我们使用的是Transformer模型，因此内存需求随着生成的输出序列的大小而增加。在本例中，我们将max_length定义为500个tokens。

这些结果表明，Llama2模型可以在合理的时间范围内产生复杂而微妙的响应。请记住，我们使用的是最小的模型(7b)和更重要的量化(4位)。

结论

正如我们在本文中所展示的，Meta的Llama 2模型为开源社区提供了新的可能性。我们介绍了Llama2的一些关键特征和特点，包括它的训练过程、架构和prompt工程设计。

此外，我们还提供了一个关于使用FastAPI, Redis和Celery构建Llama2聊天应用程序的深入指南。它应该允许任何人开始构建可扩展的实时应用程序，为数千名用户提供Llama 2(或任何其他商业许可的LLM)服务。

在我们的结果中，我们展示了该模型在为复杂的prompts生成详细且上下文丰富的响应方面的性能。

我的学习笔记

分析Llama2特性，利用FastAPI、Celery、Redis和Docker实现可扩展的聊天机器人

概览