文章/答案/技术大牛

发布

社区首页 >问答首页 >Y- aprob的卡氏Pong交叉熵/对数损失解释

问Y- aprob的卡氏Pong交叉熵/对数损失解释
EN

Stack Overflow用户

提问于 2019-02-19 04:56:17

回答 1查看 337关注 0票数 1

我正在尝试理解Python的pong代码，这里解释了：卡萨帕乒乓

# forward the policy network and sample an action from the returned probability
  #########action 2 is up and 3 is down
  aprob, h = policy_forward(x)
  print("aprob\n {}\n h\n {}\n".format(aprob, h))
  #2 is up, 3 is down
  action = 2 if np.random.uniform() < aprob else 3 # roll the dice!
  print("action\n {}\n".format(action))
  # record various intermediates (needed later for backprop)
  xs.append(x) # observation, ie. the difference frame?
  #print("xs {}".format(xs))
  hs.append(h) # hidden state obtained from forward pass
  #print("hs {}".format(hs)) 
  #if action is up, y = 1, else 0
  y = 1 if action == 2 else 0 # a "fake label"
  print("y \n{}\n".format(y))
  dlogps.append(y - aprob) # grad that encourages the action that was taken to be taken (see http://cs231n.github.io/neural-networks-2/#losses if confused)
  print("dlogps\n {}\n".format(dlogps))
  # step the environment and get new measurements
  observation, reward, done, info = env.step(action)
  print("observation\n {}\n reward\n {}\n done\n {}\n ".format(observation, reward, done))
  reward_sum += reward
  print("reward_sum\n {}\n".format(reward_sum))
  drs.append(reward) # record reward (has to be done after we call step() to get reward for previous action)
  print("drs\n {}\n".format(drs))
  if done: # an episode finished
    episode_number += 1

在上面的片段中，我不太明白为什么需要一个假标签，这意味着什么：

dlogps.append(y - aprob)# grad that encourages the action that was taken to be taken (see http://cs231n.github.io/neural-networks-2/#losses if confused)

为什么是假标签y减去aprob？

我的理解是，网络输出一个向上移动的“日志概率”，但解释似乎表明，标签实际上应该是采取该行动所获得的奖励，如果它是成功的，则鼓励在某一集内的所有行为。因此，我不明白1或0的假标签有什么用。

同样，在前向传递函数中，没有日志操作，那么它如何是日志概率？

#forward pass, how is logp a logp without any log operation?????
def policy_forward(x):
  h = np.dot(model['W1'], x)
  h[h<0] = 0 # ReLU nonlinearity
  logp = np.dot(model['W2'], h)
  p = sigmoid(logp)
  #print("p\n {}\n and h\n {}\n".format(p, h))
  return p, h # return probability of taking action 2 (up), and hidden state

编辑：

我使用print语句来查看引擎盖下发生了什么，并发现自从y=0用于向下操作，(y - aprob)将为负值。他用优势epdlogp *= discounted_epr来调节梯度的公式，最终仍然表明向下移动是否好，即。负数或坏数字.一个正数。

对于向上的动作，当应用公式时，相反是正确的。即。epdlogp *= discounted_epr的阳性数表示行为良好，负意味着行为不良。

因此，这似乎是一种很好的实现方法，但我仍然不明白为什么前传返回的aprob是日志概率，因为控制台的输出如下所示：

aprob
 0.5

action
 3

aprob
 0.5010495775824385

action
 2

aprob
 0.5023498477623756

action
 2

aprob
 0.5051575154468827

action
 2

这些看起来像是0到1之间的概率。那么使用y - aprob作为一个“日志概率”仅仅是一种经过多年实践发展起来的直觉攻击吗？如果是的话，这些黑客是否经过反复试验才发现的？

编辑:多亏了汤米的伟大解释，我知道在我的Udacity深度学习课程视频中寻找日志概率和交叉熵：continue=94&v=iREoPUrpXvE的复习器。

另外，这张小手册也有所帮助：functions.html

python

gradient

reinforcement-learning

回答 1

Stack Overflow用户

回答已采纳

发布于 2019-04-03 21:08:56

我对他如何获得(y-aprob)的解释：

当他向前通过他的网络时，最后一步是将乙状结肠S(x)应用于最后一个神经元的输出。

S(x) = 1 / (1+e^-x)

及其梯度

grad S(x) = S(x)(1-S(X))

要增加/减少你行动的可能性，你必须计算出你的“标签”概率的日志。

L = log p(y|x)

要反向传播这一点，你必须计算你的似然L的梯度。

grad L = grad log p(y|x)

由于对输出应用了sigmoid函数p= S(y)，所以实际计算

grad L = grad log S(y)   
grad L = 1 / S(y) * S(y)(1-(S(y))  
grad L = (1-S(y))  
**grad L = (1-p)**

这实际上只不过是日志丢失/交叉熵。一个更普遍的办法是：

L = - (y log p + (1-y)log(1-p))  
grad L = y-p with y either 0 or 1

由于Andrej在他的示例中没有使用像Tensorflow或PyTorch这样的框架，所以他在那里做了一些反向传播。

一开始，我也很困惑，我花了一些时间才弄清楚那里到底发生了什么。也许他可以说得更清楚一些，并给出一些提示。

至少这是我对他的代码的谦卑理解:)

票数 2

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/54759093

复制

相似问题

问Y- aprob的卡氏Pong交叉熵/对数损失解释
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Y- aprob的卡氏Pong交叉熵/对数损失解释EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问Y- aprob的卡氏Pong交叉熵/对数损失解释
EN