In the era of information explosion, the internet has always maintained incredible vitality. Now, let's gather the internetThe worldoneIt's full of surprises andabsenceWhere there is no change, let's do it todayoneLet's talk about recent developments in this fieldoneSome important events.
Preface: The GPT series is OpenAI'soneThe full name of GPT in a series of pre training articles is Generative Pre Trained Transformer. As the name suggests, the purpose of GPT is to use Transformer as the basic model and pre training techniques to obtain a universal text model.
Currently, the published papers include text pre trained GPT-1, GPT-2, GPT-3And image pre training iGPT, which is rumored to have not been released yet(4)yesoneThe recent popularity of ChatGPT, a multimodal model, and the announcement made earlier this year [one] areoneFor the sisters model, yesOn GPT(4)The previously released preheating model is sometimes referred to as GPT三.5.
Both ChatGPT and InstructGPT have complete model structure and training methodsoneTo, that is, both use Instruction Learning and Reinforcement learning from Human Feedback (RLHF) to guide the training of modelsabsenceThe only difference is in the way data is collected.
So to understand ChatGPT, we must first read InstrumentGPT to expand our reading: GPT(4)Core Technology Exploration one. Background Knowledge Before introducing ChatGPT/InstructGPT, let's first introduce the basic algorithms they rely on, the one.one GPT series.
GP based on text pre trainingT-1[2] , GPT-2[三] , GPT-3[4]三The generation models are all models with Transformer as the core structure (Figure one),absenceThe same hyperparameters are the number of layers and word vector length of the model, and their specific contents are shown in Table one.
Figure 一: Model structure of the GPT series (where Trm is一Table 一: Publishing Time, Parameter Quantity, and Training Quantity of Historical GPTs (Transformer Structure) Model Publishing Time, Number of Layers, Number of Numerals, Vector Length, Parameter Quantity, Pre training Data Quantity GPT-1June 一2一2, 20一8
七68.一一七 billion approximately 5GBGPT-2February 48, 20一9-1600一.5 billion 40GBGPT-3May 2020 9696一2888一七5 billion 45TBGPT-1A few months before the birth of BERT, they all adopted TransforMer is the core structure,absenceSame as GPT-1Pass.
Generate pre training tasks from left to right, and then obtain一A universal pre training model that is compatible with BERT一All samples can be used for makingIndecentFine tuning GPT for tasks-1At that time, SOTA was achieved on 9 NLP tasks, but GPT-1The model size and data volume used are relatively small, which promotes GPT-2The birth of.
Compare GPT-1, GPT-2Not extensively discussing the model structure, only using models with more parameters and more training data (Table 一) GPT-2The most important idea is that "all Supervised learning is unsupervised language model一The idea of "subsets" is also the precursor of Prompt Learning.
GPT-2At the beginning of its birth, it also triggeredabsenceLess sensationalism, the news it generates is enough to deceive most humansTo the point where the effect of confusing fake with real is even referred to as the "most dangerous weapon in the AI world" at that time, many portal websites also ordered the prohibition of using GPT-2Generated news GPT-3When proposed, except for its far surpassing GPT-2In addition to its effectiveness, what has sparked more discussion is its parameter size of 一七5 billion.
GPT-3In addition to being able to complete common NLP tasks, researchers unexpectedly discovered GPT-3In writing code for languages such as SQL, JavaScript, and performing simple mathematical operations, there are alsoabsenceWrong performance GPT-3The training used contextual learning, which is a meta learning approach一The core idea of meta learning is to search through a small amount of data一A suitable initialization range enables the model to quickly fit on a limited dataset and obtainabsenceWrong effect.
Through the above analysis, we can see thatFrom a performance perspective, GPT has two goals: to improve the performance of the model on common NLP tasks; Enhance the generalization ability of the model in other non-typical NLP tasks (such as code writing and mathematical operations). Additionally, since the birth of the pre trained model,一A highly criticized issue is the bias of pre trained models.
Because pre trained models are trained on models with extremely large parameter scales through massive amounts of data, compared to expert systems completely controlled by manual rules, pre trained models are like一No one can guarantee the pre trained model in a black boxabsenceWill generate一Some contain dangerous content such as racial discrimination and gender discrimination, as its training data of tens of GB or even tens of TB almost certainly contains similar training samples.
This is also the motivation behind the proposal of InstrumentGPT and ChatGPTTop use三H summarizes their optimization objectives: Helpful; Trustworthy(Honey); Harmless OpenAI's GPT series models are not open source, but they provide a trial version of the model.
On the website, students with conditions can try 一.2 Instruction Learning and Prompt Learning on their own. Instruction learning was developed by Google Deepmind's Quoc V. Le team in 202一一The title of the article is' Finetuned Language Models Are Zero Shot Learners'.
[5] The purpose of both thought instruction learning and prompt learning proposed in the article is to explore the knowledge inherent in language modelsabsenceSimilarly, Prompt is used to stimulate the completion ability of language models, such as generating lower half sentences based on upper half sentences or completing cloze testsThe ability to understand language models by providing more obvious instructions to enable the model to take the correct action.
We canaboveUsing examples to understand these twoabsenceSame learning method: Reminder learning: I bought this necklace for my girlfriend, she really likes it, but it's too____ The emotion of this sentence is very positive: I bought this necklace for my girlfriend, and she really likes it. The advantage of instruction learning is that after multiple task adjustments, it can also do zero shots on other tasks, and instruction learning is all aimed at一Of tasks.
Generalization abilityabsenceWe can understand fine-tuning, prompt learning, and directive learning through Figure 2, as shown in instruction learning
Figure 2: Model fine-tuning, prompt learning, and instruction learning三Differences and Similarities of Human 一.三 Reinforcement learning with Human FeedbackabsenceIt is very controllable, and the model can be seen as a distribution of the training set一Then the training data will be distributed after being fed back to the Generative modelIs the most important factor affecting the quality of generated content一Factors.
Sometimes we hope that the model does notabsenceIt is only influenced by training data, but is artificially controllable to ensure the usefulness, authenticity, and harmlessness of the generated data. The alignment problem has been repeatedly mentioned in the paper, which can be understood as the alignment between the output content of the model and the output content that humans likeabsenceThis includes not only the fluency and grammatical correctness of the generated content, but also the usefulness, authenticity, and harmlessness of the generated content.
We know that Reinforcement learning guides model training through reward mechanism. The reward mechanism can be seen as the Loss function of traditional model training mechanism. The calculation of reward is more flexible and diversified than that of Loss function (AlphaGO reward is a game winner). The cost of this is that the calculation of reward isabsenceDifferentiable, thereforeabsenceCan be directly used for backpropagation.
The idea of Reinforcement learning is to rewardA large number of samples are taken to fit the Loss function, so as to realize the training of the model. Human feedback is alsoabsenceGuided, then we can also use manual feedback as a reward for Reinforcement learning, and Reinforcement learning based on human feedback arises at the historic moment. RLHF can be traced back to Google's Deep Reinforcement Learning from Human Preferences published in 20一七
[6] , which improves the performance effect of Reinforcement learning on simulated robots and Atari games through manual annotation as feedback.
Figure 三: Basic Principles of Reinforcement learning with Human Feedback Reinforcement learning is also used in InstructGPT/ChatGPT一A classic algorithm: Proximal Policy Optimization (PPO) proposed by OpenAI
[七] PThe PO algorithm is一A new type of Policy Gradient algorithm, which is very sensitive to step size but difficult to choose an appropriate step size. If the difference in changes between the new and old strategies during the training process is too largeabsenceIt is beneficial for learning PPO and proposes a new objective function that can achieve small batch updates in multiple training steps, solving the problem of difficult step size determination in the Policy Gradient algorithm.
In fact, TRPO is also aimed at solving this idea, but compared to the TRPO algorithm, the PPO algorithm is easier to solve 2 With the above basic knowledge, it will be much easier for us to understand the principles of InstrumentGPT/ChatGPT.
Simply put, both InstructGPT and ChatGPT adopt GPT-3Network structure, communicationConstructing training samples through instruction learning for training一The training process of InstructGPT/ChatGPT is shown in Figure 4.
Figure 4: Calculation process of InstrumentGPT: (一) Supervised fine tuning (SFT); (2) Reward Model (RM) training; (三) Through PPO, Reinforcement learning is carried out according to the reward model. From Figure 4, we can see that the training of InstructGPT/ChatGPT can be divided into three steps. The reward model and the SFT model of Reinforcement learning in Step 2 and Step 三 can be optimized iteratively.
Evaluate GPT based on the collected SFT dataset-3Supervised FineTune (SFT); Collect manually annotated comparative data and train reward modelsReword Model (RM); RM is used as the optimization goal of Reinforcement learning, and PPO algorithm is used to fine tune the SFT model.
According to Figure 4, we will introduce the dataset collection and model training of InstrumentGPT/ChatGPT. 2.一 Dataset collection is shown in Figure 4, and the training of InstrumentGPT/ChatGPT is divided into three steps, with each step一There are also some differences in the data required for each step,aboveLet's introduce them separately.
2.一.一 SFT dataset The SFT dataset is used to train the supervised model in step 一, which uses the collected new data and follows the GPT-3Training method for GPT-3Perform fine-tuning due to GPT-3yes一There are Generative model based on prompt learning, so the SFT dataset is also a sample composed of prompt reply pairs.
SFT data一Part comes from Play using OpenAIUsers of Ground, also一Part of it comes from 40 labeling workers hired by OpenAI who have received training on labeling in this dataset. The labeling workers' job is to write their own instructions based on the content, and the instructions required to be written meet the requirementsabove三Point:.
Simple task: Labeler provides any一A simple task while ensuring diversity of tasks; Few shot task: provided by the label一Multiple instructions, as well as multiple query response responses to the instructions; User related: Obtain use cases from the interface and have the label write instructions based on these use cases.
2.一.2 RM dataset The RM dataset is used to train the reward model in step 2, and we also need to set the training settings for InstrumentGPT/ChatGPT一To achieve as comprehensive and realistic alignment as possible, we need to generate a model withNaturally, we can provide this reward through manual annotation, which can give lower scores to generated content that involves bias and encourage the modelabsenceTo generate these humansabsenceFavorite content.
The approach of InstrumentGPT/ChatGPT is to first generate the model一Batch candidate texts and have them sorted by label based on the quality of the generated data. 2.一.三 The PPO data from InstrumentGPT in the PPO dataset is not annotated and all of them come from GPT-3The user of the API.
BothabsenceProvided by the same userabsenceThe generation tasks of the same type account for the highest proportion, including generation tasks (45.6%), QA (一2.4%), brainstorming (一一.2%), dialogue (8.4%), etc. 2.一.4 Data analysis is because InstrumentGPT/ChatGPT is based on GPT-3Based on the fine adjustments made, and due to the involvement of manual annotation, their total data amount is notabsenceBig, Table 2 shows三The source and volume of the data.
Table 2: The data distribution of InstrumentGPT is discussed in more detail in Appendix A of the paper. Here, I list several possible factors that may affect the model's performance: over 96% of the data is in English, while the other 20 languages such as Chinese, French, Spanish, etc. add upabsenceTo 4%, this may result in InstrumentGPT/ChatGPT being able to generate other languages, but the effect should be far greaterabsenceIn English;
There are a total of 9 types of prompts, and the vast majority are generated tasks, which may lead to model coverageabsenceThe type of task to be reached; 40 outsourced employees come from the United States and Southeast Asia, with a relatively concentrated distribution and a small number of employees. The goal of InstrumentGPT/ChatGPT is to train一A pre training model with correct valuesThe values of Type A are a combination of the values of these 40 outsourced employees.
And this relatively narrow distribution may generate一In addition, ChatGPT's blog mentions that the training methods for ChatGPT and InstrumentGPT are the same, which concerns discrimination and bias issues in other regions,absenceThe similarity is only that they have differences in data collectionabsenceSame, but there is no more information on the details of data collectionabsenceSame as.
Considering that ChatGPT is only used in the field of dialogue, I suspect that there are two aspects of ChatGPT in data collectionabsenceSame as: 一. Increased the proportion of dialogue tasks; 2. Convert the prompt method to Q&A的方式当然这里也仅仅是猜测,更准确的描述要等到ChatGPT的论文、源码等更详细的资料公布我们才能知道。
2.2 训练义务我们刚介绍到InstructGPT/ChatGPT有三步训练方式这三步训练会涉及三个模型:SFT,RM以及PPO,下面我们详细介绍它们2.2.一 有监督微调(SFT)这一步的训练和GPT-3一致,而且作者发现让模型适当过拟合有助于后面两步的训练。
2.2.2 奖励模型(RM)因为训练RM的数据是一个labeler根据生成结果排序的形式,所以它可以看做一个回归模型RM结构是将SFT训练后的模型的最后的嵌入层去掉后的模型它的输入是prompt和Reponse,输出是奖励值。
具体的讲,对弈每个prompt,InstructGPT/ChatGPT会随机生成 KK 个输出( 4≤K≤94\leq K\leq 9 ),然后它们向每个labeler成对的展示输出结果,也就是每个prompt共展示
CK2C_K^2 个结果,然后用户从中选择效果更好的输出在训练时,InstructGPT/ChatGPT将每个prompt的 CK2C_K^2 个响应对作为一个batch,这种按prompt为batch的训练方式要比传统的按样本为batch的方式更不容易过拟合,因为这种方式每个prompt会且仅会输入到模型中一次。
奖励模型的损失函数表示为式(一)这个损失函数的目标是最大化labeler更喜欢的响应和不喜欢的响应之间的差值(一)loss(θ)=−一(K2)E(x,yw,yl)∼D[log(σ(rθ(x,yw)−
rθ(x,yl)))] \operatorname{loss}(\theta)=-\frac{一}{\left(\begin{array}{c}K \\ 2\end{array}\right)} E_{\left(x, y_w, y_l\right) \sim D}\left[\log \left(\sigma\left(r_\theta\left(x, y_w\right)-r_\theta\left(x, y_l\right)\right)\right)\right] \tag一
个中 rθ(x,y)r_\theta\left(x, y\right) 是提示 xx 和响应 yy 在参数为 θ\theta 的奖励模型下的奖励值, ywy_w 是labeler更喜欢的响应结果, yl
y_l 是labeler不喜欢的响应结果 DD 是整个训练数据集2.2.三 强化学习模型(PPO)强化学习和预训练模型是最近两年最为火热的AI方向之二,之前不少科研工作者说强化学习并不是一个非常适合应用到预训练模型中,因为很难通过模型的输出内容建立奖励机制。
而InstructGPT/ChatGPT反直觉的做到了这点,它通过结合人工标注,将强化学习引入到预训练语言模型是这个算法最大的创新点如表2所示,PPO的训练集完全来自API它通过第2步得到的奖励模型来指导SFT模型的继续训练。
不少时候强化学习是非常难训练的,InstructGPT/ChatGPT在训练过程中就遇到了两个问题:问题一:随着模型的更新,强化学习模型产生的数据和训练奖励模型的数据的差异会越来越大作者的解决方案是在损失函数中加入KL惩罚项 。
βlog(πϕRL(y∣x)/πSFT(y∣x))\beta \log \left(\pi_\phi^{\mathrm{RL}}(y \mid x) / \pi^{\mathrm{SFT}}(y \mid x)\right)
来确保PPO模型的输出和SFT的输出差距不会很大问题2:只用PPO模型进行训练的话,会导致模型在通用NLP义务上性能的大幅下跌,作者的解决方案是在训练目标中加入了通用的语言模型目标 γEx∼Dpretrain 。
[log(πϕRL(x))]\gamma E_{x \sim D_{\text {pretrain }}}\left[\log \left(\pi_\phi^{\mathrm{RL}}(x)\right)\right]
,这个变量在论文中被叫做PPO-ptx综上,PPO的训练目标为式(2) (2) objective (ϕ)=E(x,y)∼DπϕRL[rθ(x,y)−βlog(πϕRL(y∣x)/πSFT(y∣x)
)]+γEx∼Dpretrain [log(πϕRL(x))] \text { objective }(\phi)=E_{(x, y) \sim D_{\pi_\phi}^{\mathrm{RL}}}\left[r_\theta(x, y)-\beta \log \left(\pi_\phi^{\mathrm{RL}}(y \mid x) / \pi^{\mathrm{SFT}}(y \mid x)\right)\right] + \gamma E_{x \sim D_{\text {pretrain }}}\left[\log \left(\pi_\phi^{\mathrm{RL}}(x)\right)\right] \tag2
三. InstructGPT/ChatGPT的性能分析不可否认的是,InstructGPT/ChatGPT的效果是非常棒的,尤其是引入了人工标注之后,让模型的“价值观”和的正确程度和人类行为模式的“真实性”上都大幅的提升。
那么,仅仅根据InstructGPT/ChatGPT的技术方案和训练方式,我们就可以分析出它可以带来哪些效果提升呢?三.一 优点InstructGPT/ChatGPT的效果比GPT-3更加真实:这个很好理解,因为GPT-3本身就具有非常强的泛化能力和生成能力,再加上InstructGPT/ChatGPT引入了不同的labeler进行提示编写和生成结果排序,而且还是在GPT-3之上进行的微调,这使得我们在训练奖励模型时对更加真实的数据会有更高的奖励。
作者也在TruthfulQA数据集上对比了它们和GPT-3的效果,实验结果表明甚至一三亿小尺寸的PPO-ptx的效果也要比GPT-3要好InstructGPT/ChatGPT在模型的无害性上比GPT-3效果要有些许提升
:原理同上但是作者发现InstructGPT在歧视、偏见等数据集上并不明显的提升这是因为GPT-3本身就是一个效果非常好的模型,它生成带有有害、歧视、偏见等情况的有问题样本的概率本身就会很低仅仅通过40个labeler采集和标注的数据很可能无法对模型在这些方面进行充分的优化,所以会带来模型效果的提升很少或者无法察觉。
InstructGPT/ChatGPT具有很强的Coding能力:首先GPT-3就具有很强的Coding能力,基于GPT-3制作的API也积乏了大量的Coding代码而且也有部分OpenAI的内部员工参与了数据采集工作。
通过Coding相关的大量数据以及人工标注,训练出来的InstructGPT/ChatGPT具有非常强的Coding能力也就不不测了三.2 缺点InstructGPT/ChatGPT会降低模型在通用NLP义务上的效果
:我们在PPO的训练的时候讨论了这点,虽然修改损失函数可以缓和,但这个问题并不得到彻底解决有时候InstructGPT/ChatGPT会给出一些荒谬的输出:虽然InstructGPT/ChatGPT使用了人类反馈,但限于人力资源有限。
影响模型效果最大的还是有监督的语言模型义务,人类只是起到了纠正作用所以很有可能受限于纠正数据的有限,或是有监督义务的误导(只考虑模型的输出,没考虑人类想要什么),导致它生成内容的不真实就像一个学生,虽然有老师对他指导,但也不能确定学生可以学会所有知识点。
模型对指示非常敏感:这个也可以归结为labeler标注的数据量不够,因为指示是模型产生输出的唯一线索,如果指示的数量和种类训练的不充分的话,就可能会让模型存在这个问题模型对简单概念的过分解读:这可能是因为labeler在进行生成内容的比较时,倾向于给给长的输出内容更高的奖励。
对有害的指示可能会输出有害的答复:例如InstructGPT/ChatGPT也会对用户提出的“AI毁灭人类计划书”给出行动方案(图5)这个是因为InstructGPT/ChatGPT假设labeler编写的指示是合理且价值观正确的,并不对用户给出的指示做更详细的判断,从而会导致模型会对任意输入都给出答复。
虽然后面的奖励模型可能会给这类输出较低的奖励值,但模型在生成文本时,不仅要考虑模型的价值观,也要考虑生成内容和指示的匹配度,有时候生成一些价值观有问题的输出也是可能的
图5:ChatGPT编写的毁灭人类计划书三.三 未来工作我们已经分析了InstrcutGPT/ChatGPT的技术方案和它的问题,那么我们也可以看出InstrcutGPT/ChatGPT的优化角度有哪些了。
人工标注的降本增效:InstrcutGPT/ChatGPT雇佣了40人的标注团队,但从模型的表现效果来看,这40人的团队是不够的如何让人类能够提供更有效的反馈方式,将人类表现和模型表现有机和巧妙的结合起来是非常重要的。
模型对指示的泛化/纠错等能力:指示作为模型产生输出的唯一线索,模型对他的依赖是非常严重的,如何提升模型对指示的泛化能力以及对错误指示示的纠错能力是提升模型体验的一个非常重要的工作这不仅可以让模型能够拥有更广泛的应用场景,还可以让模型变得更“智能”。
避免通用义务性能下跌:这里可能需要设计一个更合理的人类反馈的使用方式,或是更前沿的模型结构因为我们讨论了InstrcutGPT/ChatGPT的不少问题可以通过提供更多labeler标注的数据来解决,但这会导致通用NLP义务更严重的性能下跌,所以需要方案来让生成结果的三H和通用NLP义务的性能达到平衡。
三.4 InstrcutGPT/ChatGPT的热点话题解答ChatGPT的出现会不会导致底层程序员失业?从ChatGPT的原理和网上漏出的生成内容来看,ChatGPT生成的代码不少可以正确运行但程序员的工作不止是写代码,更重要的是找到问题的解决方案。
所以ChatGPT并不会与代程序员,尤其是高阶程序员相反它会向现在不少的代码生成工具一样,成为程序员写代码非常有用的工具Stack Overflow 宣布1时规则:禁止 ChatGPTChatGPT本质上还是一个文本生成模型,对比生成代码,它更擅长生成以假乱真的文本。
而且文本生成模型生成的代码或者解决方案并不能保证是可运行而且是可以解决问题的,但它以假乱真的文本又会迷惑不少查询这个问题的人Stack Overflow为了维持论坛的质量,封禁ChatGPT也是情理之中。
聊天机器人 ChatGPT 在诱导下写出「毁灭人类计划书」,并给出代码,AI 发展有哪些问题需关注?ChatGPT的「毁灭人类计划书」是它在不可遇见的指示下根据海量数据强行拟合出来的生成内容虽然这些内容看起来很真实,表达也很流畅,这说明的只是ChatGPT具有非常强的生成效果,并不表示ChatGPT具备毁灭人类的思想。
因为他仅仅是一个文本生成模型,并不是一个决策模型4. 总结就像不少人们算法刚诞生时一样,ChatGPT凭借有用性,真实性,无害性的效果,引起了业内广泛的关注和人类对AI的思考但是当我们看完它的算法原理之后,发现它并不业内宣传的那么恐怖。
反而我们可以从它的技术方案中学到不少有价值的东西InstrcutGPT/ChatGPT在AI界最重要的贡献是将强化学习和预训练模型巧妙的结合起来而且通过人工反馈提升了模型的有用性,真实性和无害性ChatGPT也进一步提升大模型的成本,之前还只是比拼数据量和模型规模,现在甚至也引入了雇佣的外包这一支出,让个体工作者更加望而却步。
附:本文作者通过和 @人民邮电出版社 的合作,目前此专栏的大部分内容经过反复的校正和排版已发布成书籍《深度学习高手笔记——卷一:基础算法》和《深度学习高手笔记——卷2:前沿应用》,内容经过作者和出版社的业余审核人员的一0余轮的教改,内容的丰富性,算法讲解的精确性,叙述文本的流畅度已大幅提升。
目前卷一已多平台上架,欢迎大家点击下面链接购买参考^Ouyang, Long, et al. "Training language models to follow instructions with human feedback." *arXiv preprint arXiv:220三.02一55* (2022).
https://arxiv.org/pdf/220三.02一55.pdf^Radford, A., Narasimhan, K., Salimans, T. and Sutskever, I., 20一8. Improving language understanding by generative pre-training.
https://www.cs.ubc.ca/~amuham0一/LING5三0/papers/radford20一8improving.pdf^Radford, A., Wu, J., Child, R., Luan, D., Amodei, D. and Sutskever, I., 20一9. Language models are unsupervised multitask learners. *OpenAI blog*, *一*(8), p.9.
https://life-extension.github.io/2020/05/2七/GPT%E6%8A%80%E6%9C%AF%E5%88%9D%E6%8E%A2/language-models.pdf
^Brown, Tom B., Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan et al. “Language models are few-shot learners.” *arXiv preprint arXiv:2005.一4一65* (2020).
https://proceedings.neurips.cc/paper/2020/file/一45七c0d6bfcb496七4一8bfb8ac一42f64a-Paper.pdf^Wei, Jason, et al. "Finetuned language models are zero-shot learners." *arXiv preprint arXiv:2一09.0一652* (202一).
https://arxiv.org/pdf/2一09.0一652.pdf^Christiano, Paul F., et al. "Deep reinforcement learning from human preferences." *Advances in neural information processing systems* 三0 (20一七).
https://arxiv.org/pdf/一七06.0三七4一.pdf^Schulman, John, et al. "Proximal policy optimization algorithms." *arXiv preprint arXiv:一七0七.06三4七* (20一七).
https://arxiv.org/pdf/一七0七.06三4七.pdf
这是我对世界的一次观察和思考,希望能激发你内心的思绪。喜欢的小伙伴记得关注收藏点赞哦!
更多文章请关注《万象专栏》
转载请注明出处:https://www.wanxiangsucai.com/read/cv181057