[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"$fWS2yWqInw_MN4pVn9uCBKHRL3PcqkBi7uuIDaxPviu8":3},{"answer":4,"createTime":5,"id":6,"options":7,"origin":10,"question":13,"related":14,"source":24,"type":41},[],"2024-11-25 08:25:44",999757224,[8,9],"对","错",{"courseImg":11,"courseName":12},"https:\u002F\u002Ftihai-oss-cloud.itihey.com\u002Fimg\u002Fcf3bb414b5ea2367f316b2d3561124c7.jpg","[共享课]人工智能","取值为负数的生存奖励总可以用小于1的衰减因子表示.( )",[15,26,36,42,51,56,65,74,83,92],{"answer":16,"createTime":5,"id":17,"options":18,"question":23,"source":24,"type":25},[],999757103,[19,20,21,22],"\u003Cimg src=\"https:\u002F\u002Ftihai-oss-cloud.itihey.com\u002Fimg\u002F2805907a1e7b9a0547b332877297e4ae.png\">","\u003Cimg src=\"https:\u002F\u002Ftihai-oss-cloud.itihey.com\u002Fimg\u002F8a95a2284af97c60bbac298893a22bf8.png\">","\u003Cimg src=\"https:\u002F\u002Ftihai-oss-cloud.itihey.com\u002Fimg\u002F4ba696f660ebea22ad4fedd4feffe342.png\">","\u003Cimg src=\"https:\u002F\u002Ftihai-oss-cloud.itihey.com\u002Fimg\u002Fe00b40e10d3a0eef658f14bc64d5a6c0.png\">","Q-learning的一个推广假设MDP问题的状态空间为S,动作空间为A,奖励函数为R(s, a, s'),衰减因子为\u003Cimg src=\"https:\u002F\u002Ftihai-oss-cloud.itihey.com\u002Fimg\u002F6494510fd9def2a2b5ff2ece65f0aa59.png\">.我们的最终目标是学习一种机器人可以在现实世界中使用的策略.然而我们只能获得模拟软件的数据而非真实机器人的数据.该模拟软件是根据转移模型\u003Cimg src=\"https:\u002F\u002Ftihai-oss-cloud.itihey.com\u002Fimg\u002F11957ae58492f97996dbe380ad9ef63e.png\">建立的,该模型与真实机器人转移模型\u003Cimg src=\"https:\u002F\u002Ftihai-oss-cloud.itihey.com\u002Fimg\u002Fd445a3eb7e7399e60d0b14c9be70fd97.png\">不同.在不改变仿真模拟软件的情况下,我们希望使用从模拟器中提取的样本来学习我们的真实机器人的q值.Q-learning的更新公式可以写为:\u003Cimg src=\"https:\u002F\u002Ftihai-oss-cloud.itihey.com\u002Fimg\u002F064d3e5c1d221745132312e5fea65740.png\">假设样本是从仿真模拟软件中抽取,则可以学到真实世界Q值得q值更新函数为:( )","v2",0,{"answer":27,"createTime":5,"id":28,"options":29,"question":34,"source":24,"type":35},[],999757116,[30,31,32,33],"BFS","DFS","UCS","无","若一搜索树的树高有限且所有单步损耗均非负,则为每条边的损耗乘上一正常数w&gt;0,以下树搜索算法中( )所得搜索路径保持不变",1,{"answer":37,"createTime":5,"id":38,"options":39,"question":40,"source":24,"type":41},[],999757128,[8,9],"基于模型的强化学习涉及纯离线计算,而模型无关的强化学习需要与环境进行在线交互.( )",3,{"answer":43,"createTime":5,"id":44,"options":45,"question":50,"source":24,"type":35},[],999757155,[46,47,48,49],"h(x)是从节点x到目标节点的最优路径的估计代价","h(x)是从节点x到目标节点的实际代价","g(x)是从初始节点到节点x的实际代价","g(x)是从初始节点到节点x的最优路径的估计代价","在估价函数中,对于g(x)和h(x) 下面描述正确的是( )",{"answer":52,"createTime":5,"id":53,"options":54,"question":55,"source":24,"type":41},[],999757175,[8,9],"贪心搜索算法一定能找到最优解,因为它总是朝着离目标状态靠近的方向生成和扩展节点.( )",{"answer":57,"createTime":5,"id":58,"options":59,"question":64,"source":24,"type":35},[],999757195,[60,61,62,63],"宽度优先搜索的特点是先生成的节点先扩展","深度优先搜索的特点是先生成的节点先扩展","深度优先搜索的特点是先扩展最新产生的节点","宽度优先搜索的特点是先扩展最新产生的节点","宽度优先搜索与深度优先搜索有何区别是( )",{"answer":66,"createTime":5,"id":67,"options":68,"question":73,"source":24,"type":25},[],999757205,[69,70,71,72],"代价最小","深度最小","深度最大","代价最大","在等代价搜索算法中,总是选择( )的节点进行扩展",{"answer":75,"createTime":5,"id":76,"options":77,"question":82,"source":24,"type":35},[],999757207,[78,79,80,81],"从随机初始值开始的值迭代能收敛到\u003Cimg src=\"https:\u002F\u002Ftihai-oss-cloud.itihey.com\u002Fimg\u002Feb83dde30e38dd8e2deec363d6351a90.png\">,其中\u003Cimg src=\"https:\u002F\u002Ftihai-oss-cloud.itihey.com\u002Fimg\u002Fe82100bc61ebe5e50da86f6d69771a87.png\">是最优策略","Q-learning采用对最优动作价值函数的近似作为学习目标,与行动策略无关,是off-policy的","当具有确定性转移模型时,Q-learning不需要探索就能收敛到最优策略","在MDP问题中,一个较大的衰减因子(接近1)意味着代理更重视长期回报","下列关于MDP和RL的说法中,正确的有( )",{"answer":84,"createTime":5,"id":85,"options":86,"question":91,"source":24,"type":25},[],999757213,[87,88,89,90],"目标状态对应的动作路径消耗是一样的","约束满足问题存在最优解","在搜索时,回溯的原因是某些冲突导致搜索不能继续进行下去","前向检查是提前将不合理的值去掉的方法","关于约束满足问题,以下说法错误的是( )",{"answer":93,"createTime":5,"id":6,"options":94,"question":13,"source":24,"type":41},[],[8,9]]