連続時間のTD誤差(Continuous TD)について理解する


\delta_{TD} = \frac{d}{dt} V^\pi(s(t)) – \frac{1}{\tau_r} V^\pi(s(t)) + r(s(t),a(t))




\delta_{TD} = r_t + \gamma V(s_{t+1}) -V(s_t)



V^\pi(s_t) := E[r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \gamma^3 r_{t+3} + \cdots]_\pi


V^\pi(s_t) &= E[r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \gamma^3 r_{t+3} + \cdots]_\pi \\
&= r_t + \gamma (E[r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + \cdots]_\pi )\\
&= r_t + \gamma V^\pi(s_{t+1})


0 = r_t + \gamma V^\pi(s_{t+1}) – V^\pi(s_t)


しかし,実際の学習の初期にはある状態がどの程度の価値を持っているのか正しく予測できないので実際遷移した後に計算した価値\(r_t + \gamma V^\pi(s_{t+1})\)と遷移前に予想していた価値\(V^\pi(s_t)\)の間に差\(\delta_{TD}\)が生じます.

この誤差のことをTemporal Difference error (TD誤差)と呼びます.


V^\pi_{i+1}(s_t) \leftarrow V^\pi_{i}(s_t) + \alpha \delta_{TD}




V^\pi(s(t)) := E\left [ \int_t^\infty e^{-\frac{(t’ – t)}{\tau_r} } r(s(t’),a(t’)) dt’ \right ]_\pi

離散の際には和でしたが,連続の場合には積分になっています.そして\(\gamma\)の代わりに指数関数による割引\(e^{-\frac{(t’ – t)}{\tau_r}}\)を報酬\(r(s(t’),a(t’))\)にかけています.時定数\(\tau_r\)が小さいほど報酬は多く割り引かれます.



&\frac{d}{dt} V^\pi(s(t)) \\
&= \frac{d}{dt} E\left [ \int_t^\infty e^{-\frac{(t’ – t)}{\tau_r} }r(s(t’),a(t’)) dt’ \right ]_\pi\\
&= \frac{d}{dt} E\left [ e^{\frac{t}{\tau_r}} \int_t^\infty e^{-\frac{t’}{\tau_r} } r(s(t’),a(t’)) dt’ \right ]_\pi\\
&= E\left [ \left ( \frac{d}{dt}e^{\frac{t}{\tau_r}}\right ) \int_t^\infty e^{-\frac{t’}{\tau_r} } r(s(t’),a(t’)) dt’ + e^{\frac{t}{\tau_r}} \left ( \frac{d}{dt}\int_t^\infty e^{-\frac{t’}{\tau_r} } r(s(t’),a(t’)) dt’\right ) \right ]_\pi\\
&= E\left [ \frac{1}{\tau_r}e^{\frac{t}{\tau_r}} \int_t^\infty e^{-\frac{t’}{\tau_r} } r(s(t’),a(t’)) dt’ + e^{\frac{t}{\tau_r}} \left ( 0 – e^{-\frac{t}{\tau_r} } r(s(t),a(t))\right ) \right ]_\pi\\
&= E\left [ \frac{1}{\tau_r} \int_t^\infty e^{-\frac{t’-t}{\tau_r} } r(s(t’),a(t’)) dt’ – r(s(t),a(t)) \right ]_\pi\\
&= \frac{1}{\tau_r} V^\pi(s(t)) – r(s(t),a(t))


0 = \frac{d}{dt} V^\pi(s(t)) – \frac{1}{\tau_r} V^\pi(s(t)) + r(s(t),a(t))


\delta_{TD} = \frac{d}{dt} V^\pi(s(t)) – \frac{1}{\tau_r} V^\pi(s(t)) + r(s(t),a(t))


\delta_{TD} &= \frac{d}{dt} V^\pi(s(t)) – \frac{1}{\tau_r} V^\pi(s(t)) + r(s(t),a(t)) \\
&\simeq \frac{V^\pi(s(t)) – V^\pi(s(t-\Delta t))}{\Delta t} – \frac{1}{\tau_r} V^\pi(s(t)) + r(s(t),a(t))\\
&= \frac{1}{\Delta t}\left( V^\pi(s(t)) – V^\pi(s(t-\Delta t)) – \frac{\Delta t}{\tau_r} V^\pi(s(t)) +\Delta t \times r(s(t),a(t)) \right)\\
&= \frac{1}{\Delta t}\left( (1-\frac{\Delta t}{\tau_r})V^\pi(s(t)) – V^\pi(s(t-\Delta t)) +\Delta t \times r(s(t),a(t)) \right)\\
&= \frac{1}{\Delta t}\left(\Delta t \times r(s(t),a(t)) + \gamma V^\pi(s(t)) – V^\pi(s(t-\Delta t)) \right)


このとき\(\gamma=1-\frac{\Delta t}{\tau_r}\)と置いています.


Doya, K. (2000). Reinforcement Learning in Continuous Time and Space. Neural Computation, 12(1), 219–245. https://doi.org/10.1162/089976600300015961
Yamanaka, K., Kimura, T., Miyazaki, M., Kawashima, N., Nozaki, D., Nakazawa, K., Yano, H., & Yamamoto, Y. (2002). Human cortical activities during Go/NoGo tasks with opposite motor control paradigms. Experimental Brain Research, 142(3), 301–307. https://doi.org/10.1007/s00221-001-0943-2
Pfurtscheller, G., Neuper, C., & Kalcher, J. (1993). 40-Hz oscillations during motor behavior in man. Neuroscience Letters, 164(1), 179–182. https://doi.org/10.1016/0304-3940(93)90886-P
面接選考では、研究提案者ご本人に研究構想の説明. (n.d.).
所属研究機関において研究倫理教育に関するプログラムをあらかじめ修了していること。. (n.d.).
2024 年 4 月1日時点で、博士号取得後 10 年未満(2014 年 4 月 2 日以降に博士号取得). (n.d.).
最大 200 件. (n.d.).
5 千万円(直接経費。通期;研究期間5年間). (n.d.).
研究期間は、原則として研究開始(2025 年 1 月以降)から 5 年間です。. (n.d.).
機械学習. (n.d.).
ロボティクス. (n.d.).
アーキテ クチャ. (n.d.).
アルゴリズム. (n.d.).
数理. (n.d.).
最長1年間、採択者の資格を持ったまま研究開 始を猶予します。. (n.d.).
クロスアポイントメントをしたことによる年収の合計が、クロスア ポイントメント実施前の年収を超える金額となることが必要です。. (n.d.).
Huh, D., & Sejnowski, T. J. (2017). Gradient Descent for Spiking Neural Networks (No. arXiv:1706.04698). arXiv. http://arxiv.org/abs/1706.04698
Huh, D., & Sejnowski, T. J. (2017, June 14). Gradient Descent for Spiking Neural Networks. ArXiv.Org. https://arxiv.org/abs/1706.04698v2
It can in principle be estimated—in spite of the fact that the implicit discrete variable zjt is non-differentiable—with the help of a suitable pseudo derivative for spikes as in refs. 3,4. (n.d.).
in the model striatum. (n.d.).
McCarthy, M. M., Moore-Kochlacs, C., Gu, X., Boyden, E. S., Han, X., & Kopell, N. (2011). Striatal origin of the pathologic beta oscillations in Parkinson’s disease. Proceedings of the National Academy of Sciences, 108(28), 11620–11625. https://doi.org/10.1073/pnas.1107748108
[20,21]. (n.d.).
[16]. (n.d.).
[DDM21]. (n.d.).
In [DDM21]. (n.d.).
Chen, B., Xu, M., Li, L., & Zhao, D. (2020). Delay-Aware Model-Based Reinforcement Learning for Continuous Control (No. arXiv:2005.05440). arXiv. http://arxiv.org/abs/2005.05440
augmented state x “ ps, a1, . . . , a∆q. (n.d.).
memoryless approach. (n.d.).
augmented state approach. (n.d.).
dSARSA. (n.d.).
RTAC [RP19],. (n.d.).
DelayCorrecting Actor-Critic (DCAC) builds on SAC [HZAL18]. (n.d.).
model-based approach. (n.d.).
We start by describing the three main approaches from the literature to tackle constant delays in state observation or action execution. (n.d.).
Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation (No. arXiv:1406.1078). arXiv. http://arxiv.org/abs/1406.1078
This suggests that we compare the predictive and non-predictive agents more directly, by having them train against each other. (n.d.).
In order to be Markovian, we must augment the original space S. (n.d.).
Firoiu, V., Whitney, W. F., & Tenenbaum, J. B. (2017). Beating the World’s Best at Super Smash Bros. Melee with Deep Reinforcement Learning.
Firoiu, V., Ju, T., & Tenenbaum, J. (2018). At Human Speed: Deep Reinforcement Learning with Action Delay (No. arXiv:1810.07286). arXiv. http://arxiv.org/abs/1810.07286
Schuitema, E., Buşoniu, L., Babuška, R., & Jonker, P. (2010). Control delay in Reinforcement Learning for real-time dynamic systems: A memoryless approach. 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems, 3226–3231. https://doi.org/10.1109/IROS.2010.5650345
Liotet, P. (2023). Delays in Reinforcement Learning (No. arXiv:2309.11096). arXiv. http://arxiv.org/abs/2309.11096
extended STBP [18]. (n.d.).
PoPSAN [17]. (n.d.).
Spatio-Temporal Backpropagation(STBP) 法 [14]. (n.d.).
[9]. (n.d.).
Naya K., Kutsuzawa K., Owaki D., & Hayashibe M. (2021). Spiking Neural Network Discovers Energy-Efficient Hexapod Motion Deep Reinforcement Learning. IEEE Access, 9, 150345–150354. https://doi.org/10.1109/ACCESS.2021.3126311
Terman, D., Rubin, J. E., Yew, A. C., & Wilson, C. J. (2002). Activity Patterns in a Model for the Subthalamopallidal Network of the Basal Ganglia. Journal of Neuroscience, 22(7), 2963–2976. https://doi.org/10.1523/JNEUROSCI.22-07-02963.2002
Mazzoni, A., Lindén, H., Cuntz, H., Lansner, A., Panzeri, S., & Einevoll, G. T. (2015). Computing the Local Field Potential (LFP) from Integrate-and-Fire Network Models. PLOS Computational Biology, 11(12), e1004584. https://doi.org/10.1371/journal.pcbi.1004584
t: 10–50 ms. (n.d.).
Our community has moved from the early critical questions (‘“Do oscillations exist?”’ or ‘“Do network oscillations assist brain computation?”’) to ‘“how”’ neuronal oscillations contribute to circuit operations and behavior.1. (n.d.).


Posted by Nakamura