Discovering Differential Equations with Physics-Informed Neural Networks and Symbolic Regression

Differential equations serve as a powerful framework to capture and understand the dynamic behaviors of physical systems. By describing how variables change in relation to each other, they provide insights into system dynamics and allow us to make predictions about the system’s future behavior.

However, a common challenge we face in many real-world systems is that their governing differential equations are often only partially known, with __ the unknown aspects manifesting in several ways:

The parameters of the differential equation are unknown. A case in point is wind engineering, where the governing equations of fluid dynamics are well-established, but the coefficients relating to turbulent flow are highly uncertain.
The functional forms of the differential equations are unknown. For instance, in chemical engineering, the exact functional form of the rate equations may not be fully understood due to the uncertainties in rate-determining steps and reaction pathways.
Both functional forms and parameters are unknown. A prime example is battery state modeling, where the commonly used equivalent circuit model only partially captures the current-voltage relationship (the functional form of the missing physics is therefore unknown). Moreover, the model itself contains unknown parameters (i.e., resistance and capacitance values).

Figure 1. The governing equations of many real-world dynamical systems are only partially known. (Image by this blog author)

Such partial knowledge of the governing differential equations hinders our understanding and control of these dynamical systems. Consequently, inferring these unknown components based on observed data becomes a crucial task in dynamical system modeling.

Broadly speaking, this process of using observational data to recover governing equations of dynamical systems falls in the domain of system identification. Once discovered, we can readily use these equations to predict future states of the system, inform control strategies for the systems, or enable theoretical investigations using analytical techniques.

Very recently, Zhang et al.(2023) proposed a promising strategy that leverages physics-informed neural networks (PINN) and symbolic regression to discover unknowns in a system of ordinary differential equations (ODEs). While their focus was on discovering differential equations for Alzheimer’s disease modeling, their proposed solution holds promise for general dynamical systems.

In this blog post, we will take a closer look at the concepts put forth by the authors and get hands-on to reproduce one of the case studies investigated in the paper. Toward that end, we will build a PINN from scratch, leverage the PySR library to perform symbolic regression, and discuss the obtained results.

If you are interested in learning best practices in physics-informed neural networks, feel free to check out my blog series here:

Physics-Informed Neural Networks: An Application-Centric Guide

Unraveling the Design Pattern of Physics-Informed Neural Networks.

With that in mind, let’s get started!

Table of Content

· 1. Case Study · 2. Why do traditional approaches fall short? · 3. PINN for System Identification (Theory) · 4. PINN for System Identification (Code) ∘ 4.1 Define the Architecture ∘ 4.2 Define ODE loss ∘ 4.3 Define gradient descent step ∘ 4.4 Data preparation ∘ 4.5 PINN Training · 5. Symbolic Regression ∘ 5.1 PySR library ∘ 5.2 Implementation ∘ 5.3 Identification results · 6. Take-away · Reference

1. Case Study

Let’s start by introducing the problem we aim to solve. In this blog, we will reproduce the first case study investigated in Zhang et al.’s original paper, i.e., discovering the Kraichnan-Orszag system from data. The system is described by the following ODEs:

with an initial condition of _u_₁(0)=1, _u_₂(0)=0.8, _u_₃(0)=0.5. The Kraichnan-Orszag system is commonly used in turbulence studies and fluid dynamics research, where the goal is to develop theoretical insights into turbulence, its structures, and its dynamics.

To mimic a typical system identification setup, we assume we only know partially about the governing ODEs. Specifically, we assume that we don’t know anything about the differential equations for _u_₁ and _u_₂. In addition, we assume we only know that the right-hand side of the differential equation for _u_₃ is a linear transformation of _u_₁ and _u_₂. Then, we can rewrite the ODE system as follows:

where _f_₁ and _f_₂ denote the unknown functions, and a and b are the unknown parameters. Our objective is to calibrate the values of a and b, as well as estimate the analytical functional form of _f_₁ and _f_₂. Essentially, we are dealing with a challenging system identification problem where both unknown parameters and function forms exist.

2. Why do traditional approaches fall short?

In the traditional paradigm of system identification, we typically employ numerical methods (e.g., Euler’s method, Runge-Kutta methods, etc.) to simulate and predict system states _u_₁, _u_₂, and _u_₃. However, those methods are fundamentally limited in that they generally require a complete form of governing differential equations, and are incapable of handling scenarios when the differential equations are only partially known.

In cases where the parameters of the equations are unknown, traditional methods often resort to optimization techniques, where an initial guess for the parameters is made, and then refined in an iterative process to minimize the difference between the observed data and the data predicted by the numerical solver. Since each optimization iteration necessitates one run of the numerical solver, this approach, while feasible, can be computationally very expensive.

Note that the above discussion only describes the case of calibrating the unknown parameters. The problem becomes even more complex when we need to estimate unknown functions in differential equations. Theoretically, we can adopt a similar methodology, i.e., making assumptions about the forms of the unknown functions before optimization. However, issues would immediately rise if we go down this path: If we assume an overly simple form, we run into the risk of underfitting, which may lead to substantial prediction errors. On the other hand, if we assume an overly complex form (e.g., with many tunable parameters), we run into the risk of overfitting, which may lead to poor generalization performance.

In summary, the traditional approach faces significant challenges when dealing with partially known differential equations:

1️⃣ Traditional numerical methods rely on having a complete form of governing differential equations to run simulations.

2️⃣ Combining traditional numerical methods with optimization algorithms can address parameter estimation problems, but often at a high computational cost.

3️⃣ For estimating unknown functions embedded in differential equations, traditional approaches may yield results that are highly sensitive to the assumed functional form, which creates risks of underfitting or overfitting.

Given these challenges, traditional approaches often fall short in addressing system identification problems where unknown parameters and functional forms coexist. This naturally leads us to the topic of physics-informed neural networks (PINNs). In the next section, we will see how PINN can effectively address the challenges faced by traditional approaches.

3. PINN for System Identification (Theory)

The physics-informed neural network (or PINN in short) is a powerful concept proposed by Raissi et al. back in 2019. The basic idea of PINN, like other physics-informed machine learning techniques, is to create a hybrid model where both the observational data and the known physical knowledge (represented as differential equations) are leveraged in model training. PINN was originally designed as an efficient ODE/PDE solver. However, researchers soon recognized that PINNs have (arguably) even greater potential in tackling inverse, system identification problems.

In the following, we will explain how PINNs can be leveraged to overcome the challenges we discussed in the previous section, one by one.

1️⃣ Traditional numerical methods rely on having a complete form of governing differential equations to run simulations.

📣 PINN’s response: Unlike traditional methods, I am capable of working with partially known differential equations, thus not confined by a complete equation to run simulations.

From an exterior perspective, PINN just resembles a conventional neural network model that takes the temporal/spatial coordinates (e.g., t, x, y) as input and outputs the target quantities (e.g., velocity u, pressure p, temperature T, etc.) we are trying to simulate. However, what sets PINNs apart from conventional NNs is that in PINN, the differential equations are used as constraints during the training process. Specifically, PINN introduces an extra loss term that accounts for the residuals of the governing differential equations, which is calculated by supplying the predicted quantities into the governing equations. By optimizing this loss term, we effectively make the trained network aware of the underlying physics.

Figure 2. Physics-informed neural networks incorporate differential equations into the loss function, therefore effectively making the trained network aware of the underlying physics. (Image by this blog author)

Since the differential equations are solely used in constructing the loss function, they have no impact on the PINN model architecture. This essentially means that we do not need to have complete knowledge of the differential equations for training. Even if we only know part of the equation, this knowledge can still be incorporated to enforce the output to obey the known physics. This flexibility of accommodating varying degrees of knowledge completeness presents a significant advantage over traditional numerical approaches.

2️⃣ Combining traditional numerical methods with optimization algorithms can address parameter estimation problems, but often at a high computational cost.

📣 PINN’s response: I can provide a computationally efficient alternative for estimating unknown parameters.

Unlike traditional approaches that treat parameter estimation as a separate optimization task, PINNs seamlessly integrate this process into the model training stage. Specifically, in PINNs, the unknown parameters are simply treated as additional trainable parameters, which are optimized along with the other neural network weights and biases during training.

Figure 3. Unknown parameters are optimized jointly with the weights and biases of PINN. At the end of the training, the final values of a and b we obtained constitute the estimates of the unknown parameters. (Image by this blog author)

In addition, PINNs fully leverage the modern deep learning framework to perform training. This allows for rapid computation of the gradients (i.e., via automatic differentiation) needed for advanced optimization algorithms (e.g., Adam), therefore greatly accelerating the parameter estimation process, especially for problems with a high-dimensional parameter space. All these factors make PINNs a competitive alternative for parameter estimation problems.

📣 PINN’s response: The unknown functions can be effectively parameterized by additional neural networks, which can be trained jointly with me, just like the previous parameter estimation scenario.

Instead of assuming the form of unknown functions, we can approximate the unknown functions with separate neural networks, and later integrate them into the main PINN model. Just as in the previous parameter estimation scenario, here, we can view those extra neural nets as an extensive set of unknown parameters to be estimated.

Figure 4. The unknown functions can be parameterized by a separate neural network and trained jointly with the original PINN. The ODE/PDE residual loss term regularizes the auxiliary neural network such that the governing equations are satisfied. In this way, the auxiliary neural network can automatically learn the optimal functional forms directly from the data. (Image by this blog author)

During training, the weights and biases of those auxiliary neural nets will be trained simultaneously with the original PINN to minimize the loss function (data loss + ODE residual loss). In doing so, those auxiliary neural networks can learn the optimal functional forms directly from the data. By removing the need to make risky assumptions about the functional form, this strategy helps alleviate the problems of underfitting and overfitting.

In summary, the power of PINN lies in its ability to work with partially known differential equations and efficiently learn unknown parameters and function forms directly from the data. This versatility sets them apart from traditional approaches, therefore making them an effective tool for system identification tasks.

In the next section, we will start working on our case study and turn theory into actual code.

4. PINN for System Identification (Code)

In this section, we will implement a PINN (in TensorFlow) to address our target case study. Let’s start by importing the necessary libraries:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.integrate import solve_ivp

import tensorflow as tf
from tensorflow import keras
tf.random.set_seed(42)

4.1 Define the Architecture

For the main PINN, we use one neural network to predict u, which has 1-dimensional input (i.e., t) and 3-dimensional output (_u_₁, _u_₂, and _u_₃). In addition, as discussed in the previous section, we use an auxiliary neural network to approximate the unknown functions _f_₁ and _f_₂, which has 4-dimensional input (i.e., t, _u_₁, _u_₂, and _u_₃) and 2-dimensional output (_f_₁ and _f_₂). The architecture of the overall PINN is shown below:

Figure 5. The architecture of the employed PINN model.(Image by this blog author)

One thing worth emphasizing again is that it is necessary to feed the auxiliary neural network with all the available features (in our current case, t, _u_₁, _u_₂, and _u_₃), as we do not know the precise functional forms of _f_₁ and _f_₂. During training, the auxiliary neural network will automatically determine which features are necessary/important in a data-driven manner.

First, let’s define the neural network that predicts u. Here, we use two hidden layers, each of which is equipped with 50 neurons and hyperbolic tangent activation functions:

def u_net(u_input):
    """Definition of the network for u prediction.

    Args:
    ----
    u_input: input for the u-net

    Outputs:
    --------
    output: the output of u-net
    """

    hidden = u_input
    for _ in range(2):
        hidden = tf.keras.layers.Dense(50, activation="tanh")(hidden)
    output = tf.keras.layers.Dense(3)(hidden)

    return output

Next, we define the auxiliary neural network that predicts f. We adopt the same network architecture:

def f_net(f_inputs, a_init=None, b_init=None):
    """Definition of the network for f prediction.

    Args:
    ----
    f_inputs: list of inputs for the f-net
    a_init: initial value for parameter a
    b_init: initial value for parameter b

    Outputs:
    --------
    output: the output of f-net
    """

    hidden = tf.keras.layers.Concatenate()(f_inputs)
    for _ in range(2):
        hidden = tf.keras.layers.Dense(50, activation="tanh")(hidden)
    output = tf.keras.layers.Dense(2)(hidden)
    output = ParameterLayer(a_init, b_init)(output)

    return output

In the code above, we add a and b to the collection of the neural network model parameters. This way, a and b can be optimized jointly with the other weights and biases of the neural network. We achieved this goal by defining a custom layer ParameterLayer:

class ParameterLayer(tf.keras.layers.Layer):

    def __init__(self, a, b, trainable=True):
        super(ParameterLayer, self).__init__()
        self._a = tf.convert_to_tensor(a, dtype=tf.float32)
        self._b = tf.convert_to_tensor(b, dtype=tf.float32)
        self.trainable = trainable

    def build(self, input_shape):
        self.a = self.add_weight("a", shape=(1,), 
                                 initializer=tf.keras.initializers.Constant(value=self._a),
                                 trainable=self.trainable)
        self.b = self.add_weight("b", shape=(1,), 
                                 initializer=tf.keras.initializers.Constant(value=self._b),
                                 trainable=self.trainable)

    def get_config(self):
        return super().get_config()

    @classmethod
    def from_config(cls, config):
        return cls(**config)

Note that this layer does nothing besides introducing the two parameters as the model attributes.

Finally, we put u-net and f-net together and define the architecture for the full PINN:

def create_PINN(a_init=None, b_init=None, verbose=False):
    """Definition of a PINN.

    Args:
    ----
    a_init: initial value for parameter a
    b_init: initial value for parameter b
    verbose: boolean, indicate whether to show the model summary

    Outputs:
    --------
    model: the PINN model
    """

    # Input
    t_input = tf.keras.Input(shape=(1,), name="time")

    # u-NN
    u = u_net(t_input)

    # f-NN
    f = f_net([t_input, u], a_init, b_init)

    # PINN model
    model = tf.keras.models.Model(inputs=t_input, outputs=[u, f])

    if verbose:
        model.summary()

    return model

In the code above, we concatenate the input t and the u-net outputs _u_₁, _u_₂, and _u_₃ before feeding them into the f-net. Also, we output both u and f in the overall PINN model. Although only u is needed in practice (as u is our modeling target), the prediction of f will become useful later for distilling its analytical functional forms (see section 5).

4.2 Define ODE loss

Next, we define the function to compute the ODE residual loss. Recall that our target ODEs are:

Therefore, we can define the function as follows:

@tf.function
def ODE_residual_calculator(t, model):
    """ODE residual calculation.

    Args:
    ----
    t: temporal coordinate
    model: PINN model

    Outputs:
    --------
    ODE_residual: residual of the governing ODE
    """

    # Retrieve parameters
    a = model.layers[-1].a
    b = model.layers[-1].b

    with tf.GradientTape() as tape:
        tape.watch(t)
        u, f = model(t)

    # Calculate gradients
    dudt = tape.batch_jacobian(u, t)[:, :, 0]
    du1_dt, du2_dt, du3_dt = dudt[:, :1], dudt[:, 1:2], dudt[:, 2:]

    # Compute residuals
    res1 = du1_dt - f[:, :1]
    res2 = du2_dt - f[:, 1:]
    res3 = du3_dt - (a*u[:, :1]*u[:, 1:2] + b)
    ODE_residual = tf.concat([res1, res2, res3], axis=1)

    return ODE_residual

Although the code above is mostly self-explanatory, several things are worth mentioning:

We used tf.GradientTape.batch_jacobian() (instead of the usual GradientTape.gradient()) to calculate the gradient of _u_₁, _u_₂, and _u_₃ w.r.t t. GradientTape.gradient() won’t work here as it computes the sum d_u_₁/dt + d_u_₂/dt + d_u_₃/dt. Potentially we could also use GradientTape.jacobian() here to compute the gradient of each output value w.r.t each input value. For more details, please refer to the official page.
We used @tf.function decorator to convert the above Python function into a TensorFlow graph. It is useful to do that as gradient calculation can be quite expensive and executing it in Graph mode can significantly accelerate the computations.

4.3 Define gradient descent step

Next, we configure the logic for calculating the gradients of total loss with respect to the parameters (network weights and biases, as well as the unknown parameters a and b). This is necessary for performing the gradient descent for model training:

@tf.function
def train_step(X_ODE, X, y, IC_weight, ODE_weight, data_weight, model):
    """Calculate gradients of the total loss with respect to network model parameters.

    Args:
    ----
    X_ODE: collocation points for evaluating ODE residuals
    X: observed samples
    y: target values of the observed samples
    IC_weight: weight for initial condition loss
    ODE_weight: weight for ODE loss
    data_weight: weight for data loss
    model: PINN model

    Outputs:
    --------
    ODE_loss: calculated ODE loss
    IC_loss: calculated initial condition loss
    data_loss: calculated data loss
    total_loss: weighted sum of ODE loss, initial condition loss, and data loss
    gradients: gradients of the total loss with respect to network model parameters.
    """

    with tf.GradientTape() as tape:
        tape.watch(model.trainable_weights)

        # Initial condition prediction
        y_pred_IC, _ = model(tf.zeros((1, 1)))

        # ODE residual
        ODE_residual = ODE_residual_calculator(t=X_ODE, model=model)

        # Data loss
        y_pred_data, _ = model(X)

        # Calculate loss
        IC_loss = tf.reduce_mean(keras.losses.mean_squared_error(tf.constant([[1.0, 0.8, 0.5]]), y_pred_IC))
        ODE_loss = tf.reduce_mean(tf.square(ODE_residual))
        data_loss = tf.reduce_mean(keras.losses.mean_squared_error(y, y_pred_data))

        # Weight loss
        total_loss = IC_loss*IC_weight + ODE_loss*ODE_weight + data_loss*data_weight

    gradients = tape.gradient(total_loss, model.trainable_variables)

    return ODE_loss, IC_loss, data_loss, total_loss, gradients

In the code above:

We consider three loss terms: the initial condition loss IC_loss, the ODE residuals lossODE_loss, and the data loss data_loss. The IC_loss is calculated by comparing the model-predicted u(t=0) with the known initial value of u, **** the ODE_loss is calculated by calling our previously defined ODE_residual_calculator function, and the data loss is calculated by simply comparing the model predictions (i.e., _ u₁, u₂,_ u₃) with their observed values.
We define the total loss as a weighted sum of IC_loss,ODE_loss, anddata_loss. Generally, the weights control how much emphasis is given to the individual loss terms during the training process. In our case study, it is sufficient to set all of them as 1.

4.4 Data preparation

In this subsection, we discuss how to organize data for PINN model training.

Recall that our total loss function contains both ODE residual loss and data loss. Therefore, we need to generate both collocation points in the time dimension (for evaluating ODE loss) and the paired input(t)-output(u) supervised data.

# Set batch size
data_batch_size = 100
ODE_batch_size = 1000

# Samples for enforcing ODE residual loss
N_collocation = 10000
X_train_ODE = tf.convert_to_tensor(np.linspace(0, 10, N_collocation).reshape(-1, 1), dtype=tf.float32)
train_ds_ODE = tf.data.Dataset.from_tensor_slices((X_train_ODE))
train_ds_ODE = train_ds_ODE.shuffle(10*N_collocation).batch(ODE_batch_size)

# Samples for enforcing data loss
X_train_data = tf.convert_to_tensor(u_obs[:, :1], dtype=tf.float32)
y_train_data = tf.convert_to_tensor(u_obs[:, 1:], dtype=tf.float32)
train_ds_data = tf.data.Dataset.from_tensor_slices((X_train_data, y_train_data))
train_ds_data = train_ds_data.shuffle(10000).batch(data_batch_size)

In the code above, we allocated 10000 equally-spaced collocation points within our target time domain [0, 10]. For facilitating data loss computation, we pre-generated the paired input(t)-output(u) dataset u_obs, with its first column being the time coordinates, and the remaining three columns representing _u_₁, _u_₂, and _u_₃, respectively. u_obs contains 1000 data points and is calculated with the following code:

# Set up simulation
u_init = [1, 0.8, 0.5]
t_span = [0, 10]
obs_num = 1000

# Solve ODEs
u_obs = simulate_ODEs(u_init, t_span, obs_num)

where simulate_ODEs is the ODE solver that simulates u-trajectory given the initial conditions and simulation domain:

def simulate_ODEs(u_init, t_span, obs_num):
    """Simulate the ODE system and obtain observational data. 

    Args:
    ----
    u_init: list of initial condition for u1, u2, and u3
    t_span: lower and upper time limit for simulation
    obs_num: number of observational data points

    Outputs:
    --------
    u_obs: observed data for u's
    """

    # Target ODEs
    def odes(t, u):
        du1dt = np.exp(-t/10) * u[1] * u[2]
        du2dt = u[0] * u[2]
        du3dt = -2 * u[0] * u[1]
        return [du1dt, du2dt, du3dt]

    # Solve ODEs
    t_eval = np.linspace(t_span[0], t_span[1], obs_num)
    sol = solve_ivp(odes, t_span, u_init, method='RK45', t_eval=t_eval)

    # Restrcture solution
    u_obs = np.column_stack((sol.t, sol.y[0], sol.y[1], sol.y[2]))

    return u_obs

The following figure shows the target u profiles. Note that we have sampled 1000 equally-spaced (t – _u_₁), (t – _u_₂), and (t – _u_₃) data pairs (contained in u_obs) as the supervised data for data loss calculation.

Figure 6. Output profiles of our currently investigated ODEs. (Image by this blog author)

4.5 PINN Training

The following code defines the main training and validation logic:

# Set up training configurations
n_epochs = 1000
IC_weight= tf.constant(1.0, dtype=tf.float32)   
ODE_weight= tf.constant(1.0, dtype=tf.float32)
data_weight= tf.constant(1.0, dtype=tf.float32)
a_list, b_list = [], []

# Initial value for unknown parameters
a_init, b_init = -1, 1

# Set up optimizer
optimizer = keras.optimizers.Adam(learning_rate=2e-2)

# Instantiate the PINN model
PINN = create_PINN(a_init=a_init, b_init=b_init)
PINN.compile(optimizer=optimizer)

# Configure callbacks
_callbacks = [keras.callbacks.ReduceLROnPlateau(factor=0.5, patience=100),
             tf.keras.callbacks.ModelCheckpoint('PINN_model.h5', monitor='val_loss', save_best_only=True)]
callbacks = tf.keras.callbacks.CallbackList(
                _callbacks, add_history=False, model=PINN)

# Start training process
for epoch in range(1, n_epochs + 1):  
    print(f"Epoch {epoch}:")

    for (X_ODE), (X, y) in zip(train_ds_ODE, train_ds_data):

        # Calculate gradients
        ODE_loss, IC_loss, data_loss, total_loss, gradients = train_step(X_ODE, X, y, IC_weight, 
                                                                         ODE_weight, data_weight, PINN)
        # Gradient descent
        PINN.optimizer.apply_gradients(zip(gradients, PINN.trainable_variables))

    # Parameter recording
    a_list.append(PINN.layers[-1].a.numpy())
    b_list.append(PINN.layers[-1].b.numpy())

    ####### Validation
    val_res = ODE_residual_calculator(tf.reshape(tf.linspace(0.0, 10.0, 1000), [-1, 1]), PINN)
    val_ODE = tf.cast(tf.reduce_mean(tf.square(val_res)), tf.float32)

    u_init=tf.constant([[1.0, 0.8, 0.5]])
    val_pred_init, _ = PINN.predict(tf.zeros((1, 1)))
    val_IC = tf.reduce_mean(tf.square(val_pred_init-u_init))

    # Callback at the end of epoch
    callbacks.on_epoch_end(epoch, logs={'val_loss': val_IC+val_ODE})

    # Re-shuffle dataset
    train_ds_data = tf.data.Dataset.from_tensor_slices((X_train_data, y_train_data))
    train_ds_data = train_ds_data.shuffle(10000).batch(data_batch_size) 

    train_ds_ODE = tf.data.Dataset.from_tensor_slices((X_train_ODE))
    train_ds_ODE = train_ds_ODE.shuffle(10*N_collocation).batch(ODE_batch_size)

As discussed previously, we set the weights for different loss components as 1.
We set the initial guess for a and b as -1 and 1, respectively. Recall that these values are different from their true values, which are -2 and 0, respectively.
For validation, we add up the ODE residual loss and initial condition loss to serve as the final validation loss. Note that we do not consider data loss here as we assume we have no access to additional paired t – u datasets for validation purposes. The computed validation loss is used to adapt the learning rate.

The following figure displays the loss convergence curve. We can see that all three loss components converged properly, indicating that the training is done satisfactorily.

Figure 7. Loss convergence plot. (Image by this blog author)

The following figure shows the comparison between the predicted u‘s and the ground truth calculated by the ODE solver. Here, we can also see that the PINN is able to accurately solve our target ODEs.

Figure 8. Comparison of predicted u's and the ground truth computed by ODE solver. — Figure 8. Comparison of predicted u‘s and the ground truth computed by ODE solver.

Nevertheless, training the PINN is not our end goal. Instead, we are more interested in estimating the unknowns embedded in our target ODEs. Let’s start with the parameter estimation. The following figure depicts the evolution of a and b.

Figure 9. The unknown parameters a and b quickly moved away from the specified initial values and converged to their true values. This demonstrates that the adopted PINN strategy is capable of performing parameter estimation for ODE systems. (Image by this blog author)

We can clearly see that as the training proceeds, the values of a and b quickly converge to their respective true values. This indicated the effectiveness of our PINN strategy for parameter estimation.

In addition to the unknown parameters, we have also obtained the estimates of unknown functions _f_₁ and _f_₂, thanks to the trained auxiliary f-net. To examine the approximation accuracy of the _f_₁ and _f_₂, we can compare them to the calculated derivative of d_u_₁/dt and d_u_₂/dt, as shown in the code below:

X_test = np.linspace(0, 10, 1000).reshape(-1, 1)
X_test = tf.convert_to_tensor(X_test, dtype=tf.float32)

with tf.GradientTape() as tape:
    tape.watch(X_test)
    u, f = PINN(X_test)

# Calculate gradients
dudt = tape.batch_jacobian(u, X_test)[:, :, 0]
du1_dt, du2_dt = dudt[:, :1], dudt[:, 1:2]

# Visualize comparison
fig, ax = plt.subplots(1, 2, figsize=(10, 4))

ax[0].scatter(du1_dt.numpy().flatten(), f[:, 0].numpy())
ax[0].set_xlabel('$du_1$/dt', fontsize=14)
ax[0].set_ylabel('$f_1$', fontsize=14)

ax[1].scatter(du2_dt.numpy().flatten(), f[:, 1].numpy())
ax[1].set_xlabel('$du_2$/dt', fontsize=14)
ax[1].set_ylabel('$f_2$', fontsize=14)

for axs in ax:
    axs.tick_params(axis='both', which='major', labelsize=12)
    axs.grid(True)

plt.tight_layout()

We can see in the following figure that the f-net predictions fully fulfill the governing ODEs, which is in agreement with the previous observations that the ODE residuals are very small.

Figure 10. Comparison between the calculated derivatives and predicted f function values.

Although we can accurately approximate the unknown functions _f_₁ and _f_₂ with an f-net, at the end of the day, f-net is a black-box neural network model. Naturally, we would like to ask: what is the exact functional form of these estimated functions? The answer could provide us with a deeper understanding of the underlying physical process, and help us generalize the results to other similar problems.

So, how can we extract these precise functional forms from our trained neural network model? We will look into that in the next section.

5. Symbolic Regression

Symbolic regression is a powerful supervised machine learning technique that can be used to discover the underlying mathematical formula that best fits a given dataset. This technique, as the name suggests, comprises two key components: symbolic and regression:

Symbolic refers to the use of symbolic expressions to model the input-output relationship, e.g., "+" for addition, "-" for subtraction, "cos" for cosine function, etc. Instead of fitting a predefined model (e.g., polynomial model, etc.), symbolic regression methods search through an entire space of potential symbolic expressions to find the best fit.
Regression refers to the process of creating a model to predict an output variable based on the input variables, thus capturing the underlying relationship between them. Although the term "regression" may invoke thoughts of linear regression, in the context of symbolic regression, it is not restricted to any specific model forms but can take a wide array of mathematical operators and structures.

In this section, we will implement the symbolic regression technique to distill the learned f-net into interpretable and compact mathematical expressions, which aligns with the strategy proposed by Zhang et al. in their original paper. We will begin by introducing the library PySR, which we will use for symbolic regression. Subsequently, we will apply this library to our problem and discuss the choice of hyperparameters. Finally, we analyze the obtained results.

5.1 PySR library

PySR is an open-source Python library designed for practical, high-performance, scientific symbolic regression. It uses advanced evolutionary optimization algorithms to search through the space of simple analytic expressions for accurate and interpretable models, such that the prediction error and model complexity are jointly minimized.

Although PySR exposes a simple Python frontend API that resembles the style of scikit-learn, its backend is written in pure-Julia under the library named SymbolicRegression.jl. This gives the user the flexibility of customizing operators and optimization loss functions while enjoying high computation performance. For more details on the working principles of PySR, please refer to this paper.

To get started with PySR, you would need to install Julia first. Afterward, run

pip3 install -U pysr

Then install Julia dependencies via

python3 -m pysr install

or from within IPython, call

import pysr
pysr.install()

PySR can also be installed via conda or docker. Please check the installation page for more details.

5.2 Implementation

Next, we apply the PySR library to distill the learned f-net into interpretable and compact mathematical expressions. To begin with, we need to generate the dataset for symbolic regression learning:

t = np.linspace(0, 10, 10000).reshape(-1, 1)
u, f = PINN.predict(t, batch_size=12800)

# Configure dataframe
df = pd.DataFrame({
    't': t.flatten(),
    'u1': u[:, 0],
    'u2': u[:, 1],
    'u3': u[:, 2],
    'f1': f[:, 0],
    'f2': f[:, 1]
})
df.to_csv('f_NN_IO.csv', index=False)

Note that for our current problem, the inputs for the symbolic regression learning are t, _u_₁, _u_₂, and _u_₃, and the outputs are _f_₁ and _f_₂. This is because, in our target ODEs, we assume _f_₁=_f_₁(t, _u_₁, _u_₂, _u_₃) and _f_₂=_f_₂(t, _u_₁, _u_₂, _u_₃). We saved the generated dataframe (see figure below) for later usage.

Figure 11. The generated dataframe for symbolic regression learning. (Image by this blog author)

After generating the dataset, we are ready to perform symbolic regression with PySR. Note that it is recommended to run the PySR code in terminals instead of in Jupyter Notebook. Although PySR provides support for Jupyter Notebook, the printing (e.g., search progress, current best results, etc.) is much nicer in a terminal environment.

Following the scikit-learn style, we start by defining a model object:

from pysr import PySRRegressor

model = PySRRegressor(
    niterations=20,  
    binary_operators=["+", "*"],
    unary_operators=[
        "cos",
        "exp",
        "sin",
        "inv(x) = 1/x",
    ],
    extra_sympy_mappings={"inv": lambda x: 1 / x},
    loss="L1DistLoss()",
    model_selection="score",
    complexity_of_operators={
        "sin": 3, "cos": 3, "exp": 3,
        "inv(x) = 1/x": 3
    }
)

Here is the break-down of the specified hyperparameters:

niterations: Number of iterations of the algorithm to run. Generally, a larger iteration number yields better results, at the cost of higher computational cost. However, since PySR allows terminating the search job early, a good practice is to simply set niterations to some very large value and keep the optimization going. Once the identified equation looks satisfactory, the job can be stopped early.
binary_operators: List of strings for binary operators used in the search. The built-in binary operators supported by PySR include +, -, *, /, ^, greater, mod, logical_or, logical_and.
unary_operators: List of unary operators used in the search. Note that unary operators only take a single scalar as input. The built-in ones include neg, square, cube, exp, abs, log, log10, log2, log1p, sqrt, sin, cos, tan, sinh, cosh, tanh, atan, asinh, acosh, atanh_clip (=atanh((x+1)%2 – 1)), erf, erfc, gamma, relu, round, floor, ceil, round, sign. Note that to supply a custom operator, we need to pass in "myfunction(x) = …" to the operator list, like what we did with "inv(x) = 1/x".
extra_[sympy](https://www.sympy.org/en/index.html)_mappings: Provides mappings between custom binary_operators or unary_operators defined in julia strings, to those same operators defined in sympy. This is useful when exporting the results.
loss: String of Julia code specifying an elementwise loss function (as defined in LossFunctions.jl). Commonly used losses include L1DistLoss()(the absolute distance loss), L2DistLoss()(the least squares loss), HuberLoss()(the Huber loss function used for robustness to outliers). The loss function specifies the optimization target for the symbolic regression search.
model_selection: Model selection criterion when selecting a final expression from the list of best expressions at each complexity. score means that the candidate model will be selected based on the highest score, which is defined as -Δlog(loss)/ΔC, where C refers to the complexity of the expression and Δ denotes local change. Therefore, if an expression has a much better loss at a slightly higher complexity, it is preferred.
complexity_of_operators: By default, all operators have a complexity of 1. To change the default complexity setting and give preference to different operators, we can supply a dictionary with the key being the operator string and the value being its corresponding complexity level. In our current case, we set all unary operators to have a complexity level of 3, which was also adopted in the original paper of Zhang et al.

It is worth mentioning that PySRRegressor exposes many other hyperparameters for setting up the algorithm, data preprocessing, stopping criteria, performance and parallelization, monitoring, environment, and results exporting. For the complete list of options for controlling the symbolic regression search, please check out the PySRRegressor Reference page.

5.3 Identification results

After specifying the model object, we can kick off the fitting process with three lines of code (for distilling the analytical forms of _f_₁):

df = pd.read_csv('f_NN_IO.csv')
X = df.iloc[:, :4].to_numpy()
f1 = df.loc[:, 'f1'].to_numpy()

model.fit(X, f1)

While the script is running, you should be able to see the progress bar and the current best equations, as shown in the figure below. Note that x0, x1, x2, and x3 correspond to t, _u_₁, _u_₂, and _u_₃, respectively.

Once the optimization job is finished, a list of candidate equations will appear in the terminal:

If we rank the equations based on their score values, we can see the top-3 equations are:

_u_₂ _u_₃ exp( -0.1053391 t )
0.60341805 _u_₂ _u_₃
_u_₂ _u_₃

Recall that our true ODE is

It is quite impressive to see that the PySR has accurately identified the essential inputs (i.e., it recognized that _u_₁ does not play a role in _f_₁) and discovered an analytical expression (top-1 result) that is fairly close to the true expression of _f_₁.

We replicate the same analysis to _f_₂. The optimization results are shown in the figure below:

This time, we notice that the true expression of _f_₂, i.e., _f_₂=_u_₁_u_₃, only appears as the second-best (in terms of score) equation. However, note that the best one, i.e., _u_₃, has a score that is only marginally higher than the second-best one. On the other hand, the loss value of _u_₁_u_₃ is one magnitude lower than using _u_₃ alone. These observations indicate that in practice, we would need domain knowledge/experience to make an informed decision regarding whether the incurred complexity for high accuracy is worth pursuing.

6. Takeaways

In this blog post, we investigated the problem of discovering differential equations from observational data. We followed the strategy proposed by Zhang et al., implemented it in code, and applied it to address a case study. Here are the key take-aways:

1️⃣ Physics-informed neural network (PINN) is a versatile tool for performing system identifications, particularly in scenarios where only partial information is known about the governing differential equations. By assimilating observational data and available physical knowledge, PINN can effectively estimate not only the unknown parameters but also unknown functions, if we adopt the trick of parameterizing the unknown functions with auxiliary neural networks, which can be jointly trained with the main PINN. All these factors contribute to a substantial advantage over traditional system identification methods.

2️⃣ Symbolic regression is a powerful tool in opening the black box of the learned neural networks. By searching through an entire space of symbolic expressions with advanced evolutionary algorithms, symbolic regression is able to extract interpretable and compact analytical expressions that can accurately describe the hidden input-output relationship. This knowledge-distillation process is greatly appreciated in practice as it can effectively enhance our understanding of the underlying system dynamics.

Before we conclude this blog, there are a couple of points worth considering when applying PINN+symbolic regression for practical problems:

1️⃣ Uncertainty quantification (UQ)

Throughout this blog, we have operated under the assumption that our observed data for _u_₁, _u_₂, and _u_₃ is noise-free. However, this assumption is generally not true, as the observational data can easily be contaminated by noise for practical dynamical systems. Consequently, both the accuracy and reliability of our system identification results will suffer. Therefore, a crucial aspect to consider is the quantification of uncertainty within our system identification workflow. Techniques such as Bayesian Neural Networks and Monte Carlo simulation could properly account for noise in the observed data and provide an estimation of the confidence interval for the predictions.

2️⃣ Sensitivity of symbolic regression

Generally speaking, the results yielded by symbolic regression may be sensitive to the employed loss function, supplied candidates of unary and binary operators, as well as the defined complexities of the operators. For example, in my attempt to reproduce the results published by Zhang et al., I was unable to obtain the exact top-3 equations for _f_₂ as shown in the original paper, although I have adopted the exact same settings (to my best knowledge). Several factors may contribute to this mismatch: first of all, the evolutionary optimization techniques are intrinsically stochastic, therefore results can vary across different runs. Secondly, it is likely that the PINN trained in the first stage is different, therefore the resultant dataset (i.e., t, _u_₁, _u_₂, _u_₃ → _f_₁, _f_₂) is also different, which in turn impacts the symbolic regression outcome.

Overall, these observations suggested that symbolic regression outcomes should not be accepted blindly. Instead, it’s crucial to rely on the domain understanding/knowledge to critically assess the plausibility of the identified equations.

If you find my content useful, you could buy me a coffee here 🤗 Thank you very much for your support!

You can find the companion notebook and script with full code here 💻

To learn the best practices of physics-informed neural networks: Unraveling the Design Pattern of Physics-Informed Neural Networks

To learn more about physics-informed operator learning: Operator Learning via Physics-Informed DeepONet

Feel free to subscribe to my newsletter or follow me on Medium.

Reference

[1] Zhang et al., Discovering a reaction-diffusion model for Alzheimer’s disease by combining PINNs with symbolic regression. arXiv, 2023.

[2] Cranmer et al., Interpretable Machine Learning for Science with PySR and SymbolicRegression.jl. arXiv, 2023.