[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"$fDg8N6CvyvcV5DytAw1mcoGoVtDATs6m5T9s-OzfMPeg":3},{"answer":4,"createTime":5,"id":6,"options":7,"origin":12,"question":15,"related":16,"source":20,"type":21},[],"2024-11-25 08:25:44",999757103,[8,9,10,11],"\u003Cimg src=\"https:\u002F\u002Ftihai-oss-cloud.itihey.com\u002Fimg\u002F2805907a1e7b9a0547b332877297e4ae.png\">","\u003Cimg src=\"https:\u002F\u002Ftihai-oss-cloud.itihey.com\u002Fimg\u002F8a95a2284af97c60bbac298893a22bf8.png\">","\u003Cimg src=\"https:\u002F\u002Ftihai-oss-cloud.itihey.com\u002Fimg\u002F4ba696f660ebea22ad4fedd4feffe342.png\">","\u003Cimg src=\"https:\u002F\u002Ftihai-oss-cloud.itihey.com\u002Fimg\u002Fe00b40e10d3a0eef658f14bc64d5a6c0.png\">",{"courseImg":13,"courseName":14},"https:\u002F\u002Ftihai-oss-cloud.itihey.com\u002Fimg\u002Fcf3bb414b5ea2367f316b2d3561124c7.jpg","[共享课]人工智能","Q-learning的一个推广假设MDP问题的状态空间为S,动作空间为A,奖励函数为R(s, a, s'),衰减因子为\u003Cimg src=\"https:\u002F\u002Ftihai-oss-cloud.itihey.com\u002Fimg\u002F6494510fd9def2a2b5ff2ece65f0aa59.png\">.我们的最终目标是学习一种机器人可以在现实世界中使用的策略.然而我们只能获得模拟软件的数据而非真实机器人的数据.该模拟软件是根据转移模型\u003Cimg src=\"https:\u002F\u002Ftihai-oss-cloud.itihey.com\u002Fimg\u002F11957ae58492f97996dbe380ad9ef63e.png\">建立的,该模型与真实机器人转移模型\u003Cimg src=\"https:\u002F\u002Ftihai-oss-cloud.itihey.com\u002Fimg\u002Fd445a3eb7e7399e60d0b14c9be70fd97.png\">不同.在不改变仿真模拟软件的情况下,我们希望使用从模拟器中提取的样本来学习我们的真实机器人的q值.Q-learning的更新公式可以写为:\u003Cimg src=\"https:\u002F\u002Ftihai-oss-cloud.itihey.com\u002Fimg\u002F064d3e5c1d221745132312e5fea65740.png\">假设样本是从仿真模拟软件中抽取,则可以学到真实世界Q值得q值更新函数为:( )",[17,22,32,40,49,54,63,72,81,90],{"answer":18,"createTime":5,"id":6,"options":19,"question":15,"source":20,"type":21},[],[8,9,10,11],"v2",0,{"answer":23,"createTime":5,"id":24,"options":25,"question":30,"source":20,"type":31},[],999757116,[26,27,28,29],"BFS","DFS","UCS","无","若一搜索树的树高有限且所有单步损耗均非负,则为每条边的损耗乘上一正常数w&gt;0,以下树搜索算法中( )所得搜索路径保持不变",1,{"answer":33,"createTime":5,"id":34,"options":35,"question":38,"source":20,"type":39},[],999757128,[36,37],"对","错","基于模型的强化学习涉及纯离线计算,而模型无关的强化学习需要与环境进行在线交互.( )",3,{"answer":41,"createTime":5,"id":42,"options":43,"question":48,"source":20,"type":31},[],999757155,[44,45,46,47],"h(x)是从节点x到目标节点的最优路径的估计代价","h(x)是从节点x到目标节点的实际代价","g(x)是从初始节点到节点x的实际代价","g(x)是从初始节点到节点x的最优路径的估计代价","在估价函数中,对于g(x)和h(x) 下面描述正确的是( )",{"answer":50,"createTime":5,"id":51,"options":52,"question":53,"source":20,"type":39},[],999757175,[36,37],"贪心搜索算法一定能找到最优解,因为它总是朝着离目标状态靠近的方向生成和扩展节点.( )",{"answer":55,"createTime":5,"id":56,"options":57,"question":62,"source":20,"type":31},[],999757195,[58,59,60,61],"宽度优先搜索的特点是先生成的节点先扩展","深度优先搜索的特点是先生成的节点先扩展","深度优先搜索的特点是先扩展最新产生的节点","宽度优先搜索的特点是先扩展最新产生的节点","宽度优先搜索与深度优先搜索有何区别是( )",{"answer":64,"createTime":5,"id":65,"options":66,"question":71,"source":20,"type":21},[],999757205,[67,68,69,70],"代价最小","深度最小","深度最大","代价最大","在等代价搜索算法中,总是选择( )的节点进行扩展",{"answer":73,"createTime":5,"id":74,"options":75,"question":80,"source":20,"type":31},[],999757207,[76,77,78,79],"从随机初始值开始的值迭代能收敛到\u003Cimg src=\"https:\u002F\u002Ftihai-oss-cloud.itihey.com\u002Fimg\u002Feb83dde30e38dd8e2deec363d6351a90.png\">,其中\u003Cimg src=\"https:\u002F\u002Ftihai-oss-cloud.itihey.com\u002Fimg\u002Fe82100bc61ebe5e50da86f6d69771a87.png\">是最优策略","Q-learning采用对最优动作价值函数的近似作为学习目标,与行动策略无关,是off-policy的","当具有确定性转移模型时,Q-learning不需要探索就能收敛到最优策略","在MDP问题中,一个较大的衰减因子(接近1)意味着代理更重视长期回报","下列关于MDP和RL的说法中,正确的有( )",{"answer":82,"createTime":5,"id":83,"options":84,"question":89,"source":20,"type":21},[],999757213,[85,86,87,88],"目标状态对应的动作路径消耗是一样的","约束满足问题存在最优解","在搜索时,回溯的原因是某些冲突导致搜索不能继续进行下去","前向检查是提前将不合理的值去掉的方法","关于约束满足问题,以下说法错误的是( )",{"answer":91,"createTime":5,"id":92,"options":93,"question":94,"source":20,"type":39},[],999757224,[36,37],"取值为负数的生存奖励总可以用小于1的衰减因子表示.( )"]