文章/答案/技术大牛

发布

社区首页 >问答首页 >产生的episodes_this_iter和timesteps_this_iter数量

问产生的episodes_this_iter和timesteps_this_iter数量
EN

Stack Overflow用户

提问于 2019-08-15 03:05:45

回答 1查看 110关注 0票数 0

运行稳定高速公路示例和设置：

# time horizon of a single rollout
HORIZON = 750
# number of rollouts per training iteration
N_ROLLOUTS = 10
# number of parallel workers
N_CPUS = 1

我希望它运行一些N_ROLLOUTS集，每个集都有HORIZON = 750个环境步骤，然后对结果配置“train_batch_size”= HORIZON * N_ROLLOUTS samples进行训练，在本例中是7500个。使用上面的设置，这大致发生了，我得到：

done: false
  episode_len_mean: 750.0
  episode_reward_max: 378.8144682438323
  episode_reward_mean: 371.58900412233226
  episode_reward_min: 363.96868303824317
  episodes_this_iter: 10
  episodes_total: 10
  experiment_id: 321488011dc74a4d9b8d4e45dd6245af
  hostname: fortiss-n-065
  info:
    default:
      cur_kl_coeff: 0.20000000298023224
      cur_lr: 4.999999873689376e-05
      entropy: 18.401643753051758
      kl: 0.018729193136096
      policy_loss: -0.035810235887765884
      total_loss: 172.8546600341797
      vf_explained_var: -0.01728551648557186
      vf_loss: 172.88673400878906
    grad_time_ms: 6208.281
    load_time_ms: 47.492
    num_steps_sampled: 7600
    num_steps_trained: 7600
    sample_time_ms: 139252.878
    update_time_ms: 451.736
  iterations_since_restore: 1
  node_ip: 192.168.17.165
  num_metric_batches_dropped: 0
  pid: 15759
  policy_reward_mean: {}
  time_since_restore: 146.0045006275177
  time_this_iter_s: 146.0045006275177
  time_total_s: 146.0045006275177
  timestamp: 1565803764
  timesteps_since_restore: 7600
  timesteps_this_iter: 7600
  timesteps_total: 7600
  training_iteration: 1

这和我预期的一样，除了执行了7600个时间步而不是7500个时间步。(3个预热步骤x 10将解释额外的30个步骤)。但至少这接近了我的预期。

现在，如果我将设置更改为：

# time horizon of a single rollout
HORIZON = 750
# number of rollouts per training iteration
N_ROLLOUTS = 50
# number of parallel workers
N_CPUS = 25

这就是结果：

  done: false
  episode_len_mean: 750.0
  episode_reward_max: 390.90993664259395
  episode_reward_mean: 377.01513002372076
  episode_reward_min: 359.4016123285148
  episodes_this_iter: 38
  episodes_total: 38
  experiment_id: 1721602809bf409daff0891552be1cc6
  hostname: fortiss-n-065
  info:
    default:
      cur_kl_coeff: 0.20000001788139343
      cur_lr: 4.999999873689376e-05
      entropy: 18.317197799682617
      kl: 0.013906443491578102
      policy_loss: -0.014860117807984352
      total_loss: 169.02618408203125
      vf_explained_var: 0.006413168739527464
      vf_loss: 169.0382843017578
    grad_time_ms: 22726.94
    load_time_ms: 66.28
    num_steps_sampled: 37600
    num_steps_trained: 37600
    sample_time_ms: 224101.383
    update_time_ms: 1391.422
  iterations_since_restore: 1
  node_ip: 192.168.17.165
  num_metric_batches_dropped: 0
  pid: 13919
  policy_reward_mean: {}
  time_since_restore: 248.4277114868164
  time_this_iter_s: 248.4277114868164
  time_total_s: 248.4277114868164
  timestamp: 1565802831
  timesteps_since_restore: 37600
  timesteps_this_iter: 37600
  timesteps_total: 37600
  training_iteration: 1

现在我无法解释这一点。我会推断: episodes_this_iter: 50 timesteps_this_iter：= 750*50= 37500

现在再一次在时间步长上有100的偏移量，至少这接近预期，但是episodes_this_iter: 38怎么可能呢？

然后，我尝试为高速公路场景设置一个多代理环境。这样，它看起来如下所示：

   # time horizon of a single rollout
   HORIZON = 750
   # number of rollouts per training iteration
   N_ROLLOUTS = 5
   # number of parallel workers
   N_CPUS = 1

   config["num_workers"] = N_CPUS
   config["train_batch_size"] = HORIZON * N_ROLLOUTS

结果是：

done: false
  episode_len_mean: 748.0
  episode_reward_max: 1655.6750207800903
  episode_reward_mean: 1655.6750207800903
  episode_reward_min: 1655.6750207800903
  episodes_this_iter: 1
  episodes_total: 1
  experiment_id: 0757df140fe446f8af0bd7fbee0ba69b
  hostname: fortiss-n-065
  info:
    default:
      cur_kl_coeff: 0.20000000298023224
      cur_lr: 4.999999873689376e-05
      entropy: 1.397594690322876
      kl: 0.007053409703075886
      policy_loss: -0.0008417787030339241
      total_loss: 59.896278381347656
      vf_explained_var: 0.20434552431106567
      vf_loss: 59.89570617675781
    grad_time_ms: 2839.323
    load_time_ms: 45.883
    num_steps_sampled: 4222
    num_steps_trained: 4222
    sample_time_ms: 21399.539
    update_time_ms: 414.231
  iterations_since_restore: 1
  node_ip: 192.168.17.165
  num_metric_batches_dropped: 0
  pid: 23115
  policy_reward_mean: {}
  time_since_restore: 24.741214990615845
  time_this_iter_s: 24.741214990615845
  time_total_s: 24.741214990615845
  timestamp: 1565808435
  timesteps_since_restore: 4222
  timesteps_this_iter: 4222
  timesteps_total: 4222
  training_iteration: 1

这里会有什么问题呢？我一直期望得到的是

episodes_this_iter = N_ROLLOUTS timesteps_this_iter = train_batch_size = HORIZON * N_ROLLOUTS

flow-project

回答 1

Stack Overflow用户

发布于 2019-08-19 07:34:54

嗯，这看起来像是RLlib的奇特之处，而不是与流有关？如果这没有帮助，很抱歉，但他们可能能够更好地回答这个问题。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/57500632

复制

相似问题

问产生的episodes_this_iter和timesteps_this_iter数量
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问产生的episodes_this_iter和timesteps_this_iter数量EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问产生的episodes_this_iter和timesteps_this_iter数量
EN