CBN (Computational Biology and Neurocomputing) seminars

Action selection performance of a reconfigurable Basal Ganglia model with a Hebbian-Bayesian Go-NoGo connectivity

by Pierre Berthet (Department of Computational Biology, CSC, KTH / SU)

Europe/Stockholm
RB35

RB35

Description
Several studies have shown a strong involvement of the basal ganglia in action selection and reward learning and segregated parallel cortico-basal ganglia-thalamo-cortical loops have been described (Gerfen et al., 1987; Parent, 1990; Haber, 2003; Fujiyama et al., 2011). The dopaminergic signal to striatum has been commonly described as coding a reward prediction error (RPE). This difference between the predicted reward and actual reward has been shown to be critical in the modulation of the synaptic plasticity in cortico-striatal synapses in the direct and indirect pathway (Shen et al., 2008). In Actor Critic models, to learn the state-action mapping that optimizes the reward, this RPE signal is used to teach the Actor the proper sensori-motor associations as well as to update the weights of the Critic, to improve the reward prediction. We developed an abstract computational model of the basal ganglia and compared its structure and behavior relatively to biological data. The computations in our model are inspired from Bayesian inference, and the synaptic plasticity changes depend on a three factor Hebbian-Bayesian rule based on co-activation of pre- and post-synaptic units and on the value of the RPE (Sandberg et al., 2002). The model implements a direct (Go) and an indirect (NoGo) pathway – as described in the basal ganglia – according to a modified Actor Critic architecture (Suri and Schultz, 2001; Sutton and Barto, 1998). The Go and NoGo projections act complementary in that the former learns when RPE is positive and the latter with negative RPE. We compared the performance of different ways this system could be configured for action selection, e.g. (i) using the classical actor activation, or (ii) using the critic to maximize predicted reward based on actions proposed by the actor. We evaluated learning performance in several types of learning paradigms, such as learning-relearning, reversal learning and n-armed bandit, often used in theoretical studies as well as in animal learning experiments. Our results show that there is not a unique best way to configure this basal ganglia model to tackle all the learning paradigms tested. As an example, when the agent had to learn one appropriate action for each state, the best results came from using only the reward prediction to select the action, and this for different probabilistic reward schedules (100 to 10%, 10% step). However, in standard reversal learning using the actor as such gave the best performance. We thus suggest that an agent might dynamically configure its action selection strategy, possibly depending on task characteristics and also how much time is available. We further ran simulations when either the direct or indirect pathway was “lesioned” and found good matches with biological data. The activity of the model during learning were similar to electrophysiological data (Samejima et al., 2005). Literature Fujiyama, F., Sohn, J., Nakano, T., Furuta, T., Nakamura, K. C., Matsuda, W., and Kaneko, T. (2011). Exclusive and common targets of neostriatofugal projections of rat striosome neurons: a single neuron-tracing study using a viral vector. The European Journal of Neuroscience 33, 668-677. Gerfen, C. R., Herkenham, M., and Thibault, J. (1987). The Neostriatal Dopaminergic Mosaic : II . Patch- and Matrix-Directed Mesostriatal Dopaminergic and Non-Dopaminergic Systems Mesostriatal. Journal of Neuroscience 7, 3915-3934. Haber, S. (2003). The primate basal ganglia: parallel and integrative networks. Journal of Chemical Neuroanatomy 26, 317-330. Parent, A. (1990). Extrinsic connections of the basal ganglia. Trends in Neurosciences 13, 254-258. Samejima, K., Ueda, Y., Doya, K., and Kimura, M. (2005). Representation of action-specific reward values in the striatum. Science (New York, N.Y.) 310, 1337-40. Sandberg, A., Lansner, A., Petersson, K. M., & Ekeberg, Ö. (2002). A Bayesian attractor network with incremental learning. Network: Computation in Neural Systems, 13(2), 179-194. Shen, W., Flajolet, M., Greengard, P., and Surmeier, D. J. (2008). Dichotomous dopaminergic control of striatal synaptic plasticity. Science 321, 848-851. Suri, R. E., and Schultz, W. (2001). Temporal difference model reproduces anticipatory neural activity. Neural Computation 13, 841–862. Sutton, R. S., and Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge Univ Press.