Learning using gradient descent on a free-energy potential

Based on K. Friston and following R. Bogacz a learning scheme is implemented using gradient descent on a “free-energy” potential.


Biological intuition

Consider a theoretical one-dimensional thermoregulator. This simple organism stays alive by maximising sojourn time in a optimal temperature state, which we assumed to be defined on evolutionary time. A homeostatic mechanism could be simple feedback control, like the thermostat on a heater. Unlike the thermostat which has direct access to (and control over) temperature, the thermoregulator relies on efferent signaling to infer and control its hidden state - temperature. The only signal the regulator has access to is the real (euclidian) distance between its current temperature, and the optimal temperature. This absolite distance on is the homeostatic error and is communicated via. a noisy efferent signal . The non-linear function relates homeostatic error to percieved efferent signal, such that when homeostatic error is exactly the percieved efferent signal is normally distributed with mean and variance .


Part I - Bayes

The likelihood function (probability of a signal given a homeostatic error) is defined as

where

Through evolutionary filtering, the agent has been endowed with strong priors on its interoceptive states and therefore expects homeostatic error to normally distributed with mean and where the subscript stands for prior. Formally .

To compute the exact distribution of sensory input we can formulate the posterior using Bayes theorem

where the denominator is

and sum the whole range of possible .

The following code implements such an exact solution and plots it. Firstly we import some dependencies:

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import norm

sns.set(style="white", palette="muted", color_codes=True)

%matplotlib inline

and then define :

# non-linear transformation of homeostatic error to percieved sensory input e.g. g(phi)
def sensory_transform(input):
        
    sensory_output = np.square(input)
        
    return sensory_output

The reason we explicitly define is that we might want to change it later. For now we assume a simple non-linear relation . The following snippet of code assumes values of and plots tbe posterior distribtuion .

def exact_bayes():
    
    # variabels 
    epsilon = 2 # observed homeostatic error 
    sigma_e = 1 # standard deviation of the homeostatic error
    epsilon_p = 3 # mean of prior homeostatic error / noisy input / simple prior
    sigma_s = 1 # variance of prior / sensory noise 
    s_range = np.arange(0.01,5,0.01) # range of sensory input
    s_step = 0.01 # step size
        
    # exact bayes (equation 4)
    numerator = (np.multiply(norm.pdf(s_range,epsilon_p,sigma_s),# prior
                            norm.pdf(epsilon,sensory_transform(s_range),sigma_e))) # likelihood
    normalisation = np.sum(numerator*s_step) # denominator / model evidence / p(noisy input) (equation 5)
    posterior = numerator / normalisation # posterior
    
    # plot exact bayes
    plt.figure(figsize=(7.5,2.5))
    plt.plot(s_range,posterior)
    plt.ylabel(r' $p(\epsilon | s)$')
    sns.despine()
exact_bayes()

png

Inspecting the graph, we find that approximately maximises the posterior. There are two fundamental problems with this approach

  1. The posterior does not take a standard form, and is thus described by (potentially) infinitely many moments, instead of just simple sufficient statistics, such as the mean and the variance of a gaussian.

  2. The normalisation term that sits in the numerator of Bayes formula

can be complicated and numerical solutions often rely on computationally intense algorithms, such as the Expectation-Maximisation algorithm.


Part II - Approximate inference

We are interested in a more general way of finding the value that maximises the posterior . This involves maximising the numerator of Bayes equation. As this is independent of the denominator and therefore maximising will maximise the posterior. By taking the logarithm to the numerator we get

and the dynamics can be derived (see notes) to be

The next snippit of code asumes values for and implements the above dynamics to find the value of that maximises the posterior using a manual implementation of the dynamics and iterating using Eulers method.

def simple_dyn():
    
    # variabels 
    epsilon = 2 # observed homeostatic error 
    sigma_e = 1 # standard deviation of the homeostatic error
    epsilon_p = 3 # mean of prior homeostatic error / noisy input / simple prior
    sigma_s = 1 # variance of prior / sensory noise 
    s_range = np.arange(0.01,5,0.01) # range of sensory input
    s_step = 0.01 # step size
    
    # assume that phi maximises the posterior 
    phi = np.zeros(np.size(s_range))
    
    # use Eulers method to find the most likely value of phi
    for i in range(1,len(s_range)):
        
        phi[0] = epsilon_p
        phi[i] = phi[i - 1] + s_step * ( ( (epsilon_p - phi[i - 1]) / sigma_e ) +
        ( ( epsilon - sensory_transform(phi[i - 1]) ) / sigma_e ) * (2 * phi[i - 1]) ) # equation 12
    
    # plot convergence
    plt.figure(figsize=(5,2.5))
    plt.plot(s_range,phi)
    plt.xlabel('Time')
    plt.ylabel(r' $\phi$')
    sns.despine()
    
simple_dyn()

png

It is clear that the output converges rapidly to , the value that maximises the posterior.

So we ask the question: What does a minimal and biologically plausible network model that can do such calculations look like?


Part III - Learning with a network model

Firstly, we must specify what exactly biologically plausible means. 1) A neuron only performs computations on the input it is given, weighted by its synaptic weights. 2) Synaptic plasticity of one neuron is only based on the activity of pre-synaptic and post-synaptic activity connecting to that neuron.

Consider the dynamics of a simple network that relies on just two neurons and is coherent with the above requirements of local computation

where and are the prediction errors

that arise from the assumption that the input is normally distributed (again, see notes for derivations). The next snippit of code implements those dynamics and thus, the network “learns” what value of that maximises the posterior.

def learn_phi():
    
    # variabels 
    epsilon = 2 # observed homeostatic error 
    sigma_e = 1 # standard deviation of the homeostatic error
    epsilon_p = 3 # mean of prior homeostatic error / noisy input / simple prior
    sigma_s = 1 # variance of prior / sensory noise 
    s_range = np.arange(0.01,5,0.01) # range of sensory input
    s_step = 0.01 # step size
    
    # preallocate
    phi = np.zeros(np.size(s_range)) 
    xi_e = np.zeros(np.size(s_range)) 
    xi_s = np.zeros(np.size(s_range))
    
    # dynamics of prediction errors for homeostatic error (xi_e) and sensory input (xi_s)
    for i in range(1,len(s_range)):
        
        phi[0] = epsilon_p # initialise best guess (prior) of homeostatic error
        xi_e[0] = 0 # initialise prediction error for homeostatic error
        xi_s[0] = 0 # initialise prediction error for sensory input
        
        phi[i] = phi[i-1] + s_step*( -xi_e[i-1] + xi_s[i-1] * ( 2*(phi[i-1]) ) ) # equation 12
        xi_e[i] = xi_e[i-1] + s_step*( phi[i-1] - epsilon_p - sigma_e * xi_e[i-1] ) # equation 13
        xi_s[i] = xi_s[i-1] + s_step*( epsilon - sensory_transform(phi[i-1]) - sigma_s * xi_s[i-1] ) # equation 14
    
    # plot network dynamics
    plt.figure(figsize=(5,2.5))
    plt.plot(s_range,phi)
    plt.plot(s_range,xi_e)
    plt.plot(s_range,xi_s)
    plt.ylabel('Activity')
    sns.despine()                
learn_phi()

png

As the figure shows, the network learns but is slower in converging than when using Eulers method, as the model relies on several nodes that are inhibits and excites each other which causes oscillatory behaviour. Both and oscillate and converges to the values where


Part IV - Learning with a network model

Recall that we assumed that homeostatic error was communicated via. a noisy efferent signal that we assumed to be normally distributed. Above, we outlined a simple sample method for finding the mean value that maximises the posterior .

By expanding this simple model, we can esimate the variance of the normal distribution as well. Considering computation in one single node computing prediction error

where is the variance of homeostatic error . Estimation of can be achieved by adding a interneuron which is connected to the prediction error node, and receives input from this via the connection with weight encoding . The dynamics are described by

which the following snippit of code implements.

def learn_sigma():
    
    # variabels 
    epsilon = 2 # observed homeostatic error 
    sigma_e = 1 # standard deviation of the homeostatic error
    epsilon_p = 3 # mean of prior homeostatic error / noisy input / simple prior
    sigma_s = 1 # variance of prior / sensory noise 
    s_range = np.arange(0.01,5,0.01) # range of sensory input
    s_step = 0.01 # step size
    
    # new variabels 
    maxt = 20 # maximum number of iterations
    trials = 2000 # of trials 
    epi_length = 20 # length of episode
    alpha = 0.01 # learning rate
    
    mean_phi = 5 # the average value that maximises the posterior
    sigma_phi = 2 # the variance of phi
    last_phi = 5 # the last observed phi
    
    # preallocate
    sigma = np.zeros(trials)
    error = np.zeros(trials)
    e = np.zeros(trials)
    
    sigma[0] = 1 # initialise sigma in 1 
    
    for j in range(1,trials):
        
        error[0] = 0 # initialise error in zero
        e[0] = 0 # initialise interneuron e in zero
        phi = np.random.normal(5, np.sqrt(2), 1) # draw a new phi every round 
        
        for i in range(1,2000):
            
            error[i] = error[i-1] + s_step*(phi-last_phi-e[i-1]) # equation 59 in Bogacz
            e[i] = e[i-1] + s_step*(sigma[j-1]*error[i-1]-e[i-1]) # equation 60 in Bogacz
            
        sigma[j] = sigma[j-1] + alpha*(error[-1]*e[-1]-1) # synaptic weight (Sigma) update
        
    # plot dynamics of Sigma
    plt.figure(figsize=(5,2.5))
    plt.plot(sigma)
    plt.xlabel('Time')
    plt.ylabel(r' $\Sigma$')
    sns.despine()           


learn_sigma()

png

Because is constantly varying phi = np.random.normal(5, np.sqrt(2), 1) never does not converge to just one value, but instead to approximately 2, the variance of .