我们正在教人工智能理解和模拟运动中的物理世界,目标是训练模型,帮助人们解决需要现实世界交互的问题。
介绍索拉,我们的文本到视频模型。索拉可以生成长达一分钟的视频,同时保持视觉质量和遵守用户的提示。
今天,索拉可以被红队队员用来评估危险区域的伤害或风险。我们还允许一些视觉艺术家、设计师和电影制作人访问,以获得有关如何推进模型以最有助于创意专业人士的反馈。
我们很早就分享了我们的研究进展,开始与OpenAI之外的人合作并从他们那里获得反馈,并让公众给予一种即将到来的AI功能的感觉。
索拉能够生成具有多个角色、特定运动类型以及主体和背景的精确细节的复杂场景。该模型不仅理解用户在提示中要求的内容,还理解这些内容在物理世界中的存在方式。
该模型对语言有着深刻的理解,使其能够准确地解释提示并生成表达充满活力的情感的引人注目的人物。索拉还可以在一个生成的视频中创建多个镜头,准确地保持人物和视觉风格。
目前的模式存在缺陷。它可能难以准确地模拟复杂场景的物理特性,并且可能无法理解因果关系的特定实例。例如,一个人可能咬了一口饼干,但后来,饼干可能没有咬痕。
该模型还可能混淆提示的空间细节,例如,混淆左和右,并且可能难以精确描述随时间发生的事件,例如遵循特定的相机轨迹。
安全
We’ll be taking several important safety steps ahead of making Sora available in OpenAI’s products. We are working with red teamers — domain experts in areas like misinformation, hateful content, and bias — who will be adversarially testing the model.
在OpenAI的产品中提供索拉之前,我们将采取几个重要的安全措施。我们正在与红色团队合作,他们是错误信息、仇恨内容和偏见等领域的专家,他们将对模型进行对抗性测试。
We’re also building tools to help detect misleading content such as a detection classifier that can tell when a video was generated by Sora. We plan to include C2PA metadata in the future if we deploy the model in an OpenAI product.
我们还在构建工具来帮助检测误导性内容,例如检测分类器,它可以判断视频是由索拉生成的。如果我们在OpenAI产品中部署该模型,我们计划在未来包含C2PA元数据。
In addition to us developing new techniques to prepare for deployment, we’re leveraging the existing safety methods that we built for our products that use DALL·E 3, which are applicable to Sora as well.
除了开发新技术为部署做准备外,我们还利用了为使用DALL·E 3的产品构建的现有安全方法,这些方法也适用于索拉。
For example, once in an OpenAI product, our text classifier will check and reject text input prompts that are in violation of our usage policies, like those that request extreme violence, sexual content, hateful imagery, celebrity likeness, or the IP of others. We’ve also developed robust image classifiers that are used to review the frames of every video generated to help ensure that it adheres to our usage policies, before it’s shown to the user.
例如,在OpenAI产品中,我们的文本分类器将检查并拒绝违反我们使用政策的文本输入提示,例如要求极端暴力,性内容,仇恨图像,名人肖像或其他人的IP。我们还开发了强大的图像分类器,用于审查生成的每个视频的帧,以帮助确保它符合我们的使用策略,然后才向用户显示。
We’ll be engaging policymakers, educators and artists around the world to understand their concerns and to identify positive use cases for this new technology. Despite extensive research and testing, we cannot predict all of the beneficial ways people will use our technology, nor all the ways people will abuse it. That’s why we believe that learning from real-world use is a critical component of creating and releasing increasingly safe AI systems over time.
我们将与世界各地的政策制定者、教育工作者和艺术家合作,了解他们的担忧,并确定这项新技术的积极用例。尽管进行了广泛的研究和测试,但我们无法预测人们使用我们技术的所有有益方式,也无法预测人们滥用我们技术的所有方式。这就是为什么我们认为,随着时间的推移,从现实世界的使用中学习是创建和发布越来越安全的人工智能系统的关键组成部分。
技术
Sora is a diffusion model, which generates a video by starting off with one that looks like static noise and gradually transforms it by removing the noise over many steps.
索拉是一个扩散模型,它通过从看起来像静态噪声的视频开始生成视频,并通过许多步骤去除噪声来逐渐转换视频。
Sora is capable of generating entire videos all at once or extending generated videos to make them longer. By giving the model foresight of many frames at a time, we’ve solved a challenging problem of making sure a subject stays the same even when it goes out of view temporarily.
索拉能够一次生成整个视频,或者扩展生成的视频,使其更长。通过让模型一次预见许多帧,我们已经解决了一个具有挑战性的问题,即确保一个对象即使暂时离开视野也保持不变。
Similar to GPT models, Sora uses a transformer architecture, unlocking superior scaling performance.
与GPT型号类似,索拉使用Transformer架构,释放了上级扩展性能。
We represent videos and images as collections of smaller units of data called patches, each of which is akin to a token in GPT. By unifying how we represent data, we can train diffusion transformers on a wider range of visual data than was possible before, spanning different durations, resolutions and aspect ratios.
我们将视频和图像表示为称为补丁的较小数据单元的集合,每个补丁类似于GPT中的令牌。通过统一我们表示数据的方式,我们可以在比以前更广泛的视觉数据上训练扩散变换器,跨越不同的持续时间,分辨率和纵横比。
Sora builds on past research in DALL·E and GPT models. It uses the recaptioning technique from DALL·E 3, which involves generating highly descriptive captions for the visual training data. As a result, the model is able to follow the user’s text instructions in the generated video more faithfully.
索拉建立在过去的研究DALL·E和GPT模型。它使用了DALL·E 3中的重新捕获技术,该技术涉及为视觉训练数据生成高度描述性的标题。因此,该模型能够更忠实地遵循用户在生成的视频中的文本指令。
In addition to being able to generate a video solely from text instructions, the model is able to take an existing still image and generate a video from it, animating the image’s contents with accuracy and attention to small detail. The model can also take an existing video and extend it or fill in missing frames. Learn more in our technical report.
除了能够仅从文本指令生成视频外,该模型还能够获取现有的静态图像并从中生成视频,以精确和关注小细节的方式动画图像的内容。该模型还可以获取现有视频并对其进行扩展或填充丢失的帧。在我们的技术报告中了解更多信息。
Sora serves as a foundation for models that can understand and simulate the real world, a capability we believe will be an important milestone for achieving AGI.
索拉是能够理解和模拟真实的世界的模型的基础,我们相信这一能力将成为实现AGI的重要里程碑。