In [1]:

Copied!

from symbolfit.symbolfit import *
from symbolfit.symbolfit import *

Detected IPython. Loading juliacall extension. See https://juliapy.github.io/PythonCall.jl/stable/compat/#IPython

Dataset¶

Five inputs are needed, which can be python lists or numpy arrays (more options will be added in future!):

x: independent variable (bin center location).
y: dependent variable.
y_up: upward uncertainty in y per bin.
y_down: downward uncertainty in y per bin.
bin_widths_1d bin widths in x.

Elements in both y_up and y_down should be non-negative values.
These values are the "delta" in y,
- y + y_up = y shifted up by one standard deviation.
- y - y_down = y shifted down by one standard deviation.
If no uncertainty in the dataset, one can set both y_up and y_down to ones with the same shape as x.

In [2]:

Copied!





x = [0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5]
y = [3.0, 2.2, 2.1, 2.05, 2, 2.1, 2.2, 1.9, 1.6]
y_up = [0.4, 0.3, 0.2, 0.2, 0.1, 0.05, 0.06, 0.1, 0.1]
y_down = [0.4, 0.3, 0.2, 0.2, 0.1, 0.05, 0.06, 0.1, 0.1]
bin_widths_1d = [0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5]
x = [0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5]
y = [3.0, 2.2, 2.1, 2.05, 2, 2.1, 2.2, 1.9, 1.6]
y_up = [0.4, 0.3, 0.2, 0.2, 0.1, 0.05, 0.06, 0.1, 0.1]
y_down = [0.4, 0.3, 0.2, 0.2, 0.1, 0.05, 0.06, 0.1, 0.1]
bin_widths_1d = [0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5]

Plot the dataset to see what we will be fitting to:

In [3]:

Copied!





fig, axes = plt.subplots(figsize = (6, 4))
plt.errorbar(np.array(x).flatten(),
             np.array(y).flatten(),
             yerr = [np.array(y_down).flatten(), np.array(y_up).flatten()],
             xerr = np.array(bin_widths_1d)/2,
             fmt = '.', c = 'black', ecolor = 'grey', capsize = 0,
            )
plt.savefig('img/toy2a/dataset.png')
fig, axes = plt.subplots(figsize = (6, 4))
plt.errorbar(np.array(x).flatten(),
             np.array(y).flatten(),
             yerr = [np.array(y_down).flatten(), np.array(y_up).flatten()],
             xerr = np.array(bin_widths_1d)/2,
             fmt = '.', c = 'black', ecolor = 'grey', capsize = 0,
            )
plt.savefig('img/toy2a/dataset.png')

No description has been provided for this image

Configure the fit¶

Configure PySR to define the function space being searched for with symbolic regression:

In [4]:

Copied!





from pysr import PySRRegressor

pysr_config = PySRRegressor(
    model_selection = 'accuracy',
    niterations = 200,
    maxsize = 80,
    binary_operators = [
        '+', '*', '/', '^'
                     ],
    unary_operators = [
        'exp',
        'tanh',
    ],
    nested_constraints = {
        'exp':    {'exp': 0, 'tanh': 0, '*': 2, '/': 1, '^': 1},
        'tanh':   {'exp': 0, 'tanh': 0, '*': 2, '/': 1, '^': 1},
        '*':      {'exp': 1, 'tanh': 1, '*': 2, '/': 1, '^': 1},
        '^':      {'exp': 1, 'tanh': 1, '*': 2, '/': 1, '^': 0},
        '/':      {'exp': 1, 'tanh': 1, '*': 2, '/': 0, '^': 1},
    },
    loss='loss(y, y_pred, weights) = (y - y_pred)^2 * weights',
)
from pysr import PySRRegressor

pysr_config = PySRRegressor(
    model_selection = 'accuracy',
    niterations = 200,
    maxsize = 80,
    binary_operators = [
        '+', '*', '/', '^'
                     ],
    unary_operators = [
        'exp',
        'tanh',
    ],
    nested_constraints = {
        'exp':    {'exp': 0, 'tanh': 0, '*': 2, '/': 1, '^': 1},
        'tanh':   {'exp': 0, 'tanh': 0, '*': 2, '/': 1, '^': 1},
        '*':      {'exp': 1, 'tanh': 1, '*': 2, '/': 1, '^': 1},
        '^':      {'exp': 1, 'tanh': 1, '*': 2, '/': 1, '^': 0},
        '/':      {'exp': 1, 'tanh': 1, '*': 2, '/': 0, '^': 1},
    },
    loss='loss(y, y_pred, weights) = (y - y_pred)^2 * weights',
)

Here, we allow four binary operators (+, *, /, pow) and two unary operators (exp, tanh) when searching for functional forms. The custom-defined gauss in the previous example may not be needed here since it this dataset is not obvious with a peak.

Nested constraints are imposed to prohibit, e.g., exp(exp(x))...

Loss function is a weighted MSE, where the weight is the sqaured uncertainty by default in SymbolFit.

For PySR options, please see:

Configure SymbolFit with the PySR config and for the re-optimization process:

In [5]:

Copied!





model = SymbolFit(
        # Dataset: x, y, y_up, y_down.
    	x = x,
    	y = y,
    	y_up = y_up,
    	y_down = y_down,
    
        # PySR configuration of the function space.
    	pysr_config = pysr_config,
    
        # Constrain the maximum function size and over-write maxsize in pysr_config.
        # Set a higher value for more complex shape, or when the lower one does not fit well.
    	max_complexity = 30,
    
        # Whether to scale input x to be within 0 and 1 for the fits for numerical stability,
        # as large x could lead to overflow when there is e.g. exp(x) -> exp(10000).
        # So set this to False when your x's are or close to O(1), otherwise recommended to set True.
        # After the fits, the functions will be unscaled to relect the original dataset.
    	input_rescale = False,
        # ^ no scaling needed here since the input x is O(1).
    
        # Whether to scale y for the fits for numerical stability,
        # options are (when input_rescale is True): None / 'mean' / 'max' / 'l2'.
        # This is useful to stabilize fits when your y's are very large or very small.
        # After the fits, the functions will be unscaled to relect the original dataset.
    	scale_y_by = None,
        # ^ no scaling needed here since the input y is O(1).
    
        # Set a maximum standard error (%) for all parameters to avoid bad fits during re-optimization.
        # In the refit loop, when any of the parameters returns a standard error larger than max_stderr,
        # the fit is considered failed, and the fit will retry itself for fewer or other combination of varying parameters,
        # by freezing some of the parameters to their initial values and kept fixed during re-optimization.
        # This is to avoid bad fits when the objective is too complex to minimize, which could cause some parameters
        # to have unrealistically large standard errors.
        # In most cases 10 < max_stderr < 100 suffices.
    	max_stderr = 20,
    
        # Consider y_up and y_down to weight the MSE loss during SR search and re-optimization.
    	fit_y_unc = True,
    
        # Set a random seed for returning the same batch of functional forms every time (single-threaded),
        # otherwise set None to explore more functions every time (multi-threaded and faster).
        # In most cases the function space is huge, one can retry the fits with the exact same fit configuration
        # and get completely different sets of candidate functions, merely by using different random seeds.
        # So if the candidate functions are not satisfactory this time, rerun it few times more with
        # random_seed = None or a different seed each time.
    	random_seed = None,
    
        # Custome loss weight to set "(y - y_pred)^2 * loss_weights", overwriting that with y_up and y_down.
    	loss_weights = None
)
model = SymbolFit(
        # Dataset: x, y, y_up, y_down.
    	x = x,
    	y = y,
    	y_up = y_up,
    	y_down = y_down,
    
        # PySR configuration of the function space.
    	pysr_config = pysr_config,
    
        # Constrain the maximum function size and over-write maxsize in pysr_config.
        # Set a higher value for more complex shape, or when the lower one does not fit well.
    	max_complexity = 30,
    
        # Whether to scale input x to be within 0 and 1 for the fits for numerical stability,
        # as large x could lead to overflow when there is e.g. exp(x) -> exp(10000).
        # So set this to False when your x's are or close to O(1), otherwise recommended to set True.
        # After the fits, the functions will be unscaled to relect the original dataset.
    	input_rescale = False,
        # ^ no scaling needed here since the input x is O(1).
    
        # Whether to scale y for the fits for numerical stability,
        # options are (when input_rescale is True): None / 'mean' / 'max' / 'l2'.
        # This is useful to stabilize fits when your y's are very large or very small.
        # After the fits, the functions will be unscaled to relect the original dataset.
    	scale_y_by = None,
        # ^ no scaling needed here since the input y is O(1).
    
        # Set a maximum standard error (%) for all parameters to avoid bad fits during re-optimization.
        # In the refit loop, when any of the parameters returns a standard error larger than max_stderr,
        # the fit is considered failed, and the fit will retry itself for fewer or other combination of varying parameters,
        # by freezing some of the parameters to their initial values and kept fixed during re-optimization.
        # This is to avoid bad fits when the objective is too complex to minimize, which could cause some parameters
        # to have unrealistically large standard errors.
        # In most cases 10 < max_stderr < 100 suffices.
    	max_stderr = 20,
    
        # Consider y_up and y_down to weight the MSE loss during SR search and re-optimization.
    	fit_y_unc = True,
    
        # Set a random seed for returning the same batch of functional forms every time (single-threaded),
        # otherwise set None to explore more functions every time (multi-threaded and faster).
        # In most cases the function space is huge, one can retry the fits with the exact same fit configuration
        # and get completely different sets of candidate functions, merely by using different random seeds.
        # So if the candidate functions are not satisfactory this time, rerun it few times more with
        # random_seed = None or a different seed each time.
    	random_seed = None,
    
        # Custome loss weight to set "(y - y_pred)^2 * loss_weights", overwriting that with y_up and y_down.
    	loss_weights = None
)

Symbol Fit it!¶

Run the fits: SR fit for functional form searching -> parameterization -> re-optimization fit for improved best-fits and uncertainty estimation -> evaluation.

In [6]:

Copied!

model.fit()
model.fit()

Compiling Julia backend...

[ Info: Started!

Expressions evaluated per second: 7.180e+05
Head worker occupation: 16.4%
Progress: 1457 / 3000 total iterations (48.567%)
====================================================================================================
Hall of Fame:
---------------------------------------------------------------------------------------------------
Complexity  Loss       Score     Equation
1           1.752e-01  1.594e+01  y = 1.68
2           3.437e-02  1.629e+00  y = exp(0.72041)
5           2.267e-02  1.387e-01  y = (-0.15192 * x₀) + 2.5457
6           1.478e-02  4.280e-01  y = 2.2524 + (exp(x₀) * -0.0061958)
7           1.043e-02  3.484e-01  y = (-0.00062681 * (x₀ ^ x₀)) + 2.1433
8           1.021e-02  2.138e-02  y = (-5.2253e-07 * (exp(x₀) ^ 3.0764)) + 2.1265
10          1.016e-02  2.503e-03  y = tanh(-2.9879e-05 * ((x₀ + x₀) ^ x₀)) + 2.1267
11          7.875e-03  2.548e-01  y = (2.1336 + (-0.00061524 * (x₀ ^ x₀))) + (0.13061 ^ x₀)
12          7.452e-03  5.516e-02  y = (2.1165 + (-5.0177e-07 * (exp(x₀) ^ 3.0805))) + (0.15228 ^...
                                   x₀)
13          6.008e-03  2.153e-01  y = (2.1336 + (-0.00061524 * (x₀ ^ x₀))) + ((0.013848 ^ x₀) / ...
                                  0.13724)
14          5.845e-03  2.759e-02  y = ((-5.0177e-07 * (exp(x₀) ^ 3.0805)) + 2.1165) + ((0.14706 ...
                                  ^ x₀) / x₀)
15          5.702e-03  2.468e-02  y = ((0.47413 / x₀) ^ 3.1654) + ((-3.4388e-07 * (exp(x₀) ^ 3.1...
                                  654)) + exp(0.74782))
16          5.567e-03  2.404e-02  y = ((0.45735 / x₀) ^ exp(x₀)) + ((-3.445e-07 * (exp(x₀) ^ 3.1...
                                  654)) + exp(0.75035))
17          5.551e-03  2.948e-03  y = ((0.45735 / x₀) ^ (3.172 * x₀)) + (((exp(x₀) ^ 3.172) * -3...
                                  .3335e-07) + exp(0.75035))
18          5.549e-03  3.421e-04  y = ((0.42627 / x₀) ^ (exp(x₀) * x₀)) + (exp(0.75035) + (-3.44...
                                  5e-07 * (exp(x₀) ^ 3.1654)))
19          3.413e-03  4.859e-01  y = ((x₀ ^ 0.40686) + (exp((0.47803 / x₀) ^ x₀) + ((exp(x₀) ^ ...
                                  1.8685) * -0.00019322))) + -0.39961
21          3.393e-03  2.935e-03  y = (-0.045569 + ((x₀ ^ 0.38808) + (exp((0.46993 / x₀) ^ x₀) +...
                                   ((exp(x₀) ^ 1.8773) * -0.00017942)))) + -0.32154
22          3.382e-03  3.239e-03  y = ((x₀ ^ 0.40686) + (exp((0.47803 / (x₀ * 0.95372)) ^ x₀) + ...
                                  ((exp(x₀) ^ 1.8685) * tanh(-0.00019322)))) + -0.39961
23          3.249e-03  4.024e-02  y = -0.33587 + ((x₀ ^ 0.37768) + (((exp(x₀) ^ 1.7986) * -0.000...
                                  25789) + exp(0.95391 / (x₀ + (x₀ ^ (x₀ + x₀))))))
24          2.757e-03  1.643e-01  y = (-0.33118 + (((exp(0.41338 ^ x₀) + (exp(x₀) * -0.029169)) ...
                                  + (x₀ ^ tanh(-0.54561))) + (x₀ * (0.14758 * x₀)))) + 0.045004
26          2.748e-03  1.661e-03  y = (-0.33118 + (((x₀ ^ tanh(-0.54561)) + (exp(0.41338 ^ x₀) +...
                                   (exp(x₀) * -0.029169))) + (x₀ * (0.14758 * x₀)))) + (0.045004...
                                   * 1.1576)
28          2.735e-03  2.299e-03  y = ((((exp(x₀) * -0.029169) + (exp(0.41338 ^ x₀) + 0.20771)) ...
                                  + ((x₀ ^ -0.54561) + (-0.12587 / tanh(x₀)))) + (x₀ * (x₀ * 0.1...
                                  4758))) + -0.33118
29          2.725e-03  3.784e-03  y = -0.40398 + (((exp((0.39274 + 0.36498) / (x₀ + (x₀ ^ -0.497...
                                  47))) + (exp(x₀) * -0.029507)) + ((x₀ * 0.84467) ^ -0.75566)) ...
                                  + ((x₀ * 0.15314) * x₀))
30          2.622e-03  3.847e-02  y = -0.3954 + ((((exp(x₀) * -0.028641) + exp(0.38721 / (x₀ ^ x...
                                  ₀))) + ((tanh(x₀ ^ x₀) / x₀) + 0.36897)) + ((x₀ * (x₀ * 0.1544...
                                  )) ^ 0.94872))
---------------------------------------------------------------------------------------------------
====================================================================================================
Press 'q' and then <enter> to stop execution early.


Checking if pysr_model_temp.pkl exists...
Loading model from pysr_model_temp.pkl


Re-optimizing parameterized candidate function 1/22...
    >>> loop of re-parameterization with less NDF for bad fits 1/2...

Re-optimizing parameterized candidate function 2/22...
    >>> loop of re-parameterization with less NDF for bad fits 1/2...

Re-optimizing parameterized candidate function 3/22...
    >>> loop of re-parameterization with less NDF for bad fits 1/2...

Re-optimizing parameterized candidate function 4/22...
    >>> loop of re-parameterization with less NDF for bad fits 2/4...

Re-optimizing parameterized candidate function 5/22...
    >>> loop of re-parameterization with less NDF for bad fits 2/4...

Re-optimizing parameterized candidate function 6/22...
    >>> loop of re-parameterization with less NDF for bad fits 2/4...

Re-optimizing parameterized candidate function 7/22...
    >>> loop of re-parameterization with less NDF for bad fits 2/8...

Re-optimizing parameterized candidate function 8/22...
    >>> loop of re-parameterization with less NDF for bad fits 4/8...

Re-optimizing parameterized candidate function 9/22...
    >>> loop of re-parameterization with less NDF for bad fits 5/8...

Re-optimizing parameterized candidate function 10/22...
    >>> loop of re-parameterization with less NDF for bad fits 5/8...

Re-optimizing parameterized candidate function 11/22...
    >>> loop of re-parameterization with less NDF for bad fits 10/16...

Re-optimizing parameterized candidate function 12/22...
    >>> loop of re-parameterization with less NDF for bad fits 6/16...

Re-optimizing parameterized candidate function 13/22...
    >>> loop of re-parameterization with less NDF for bad fits 6/16...

Re-optimizing parameterized candidate function 14/22...
    >>> loop of re-parameterization with less NDF for bad fits 6/16...

Re-optimizing parameterized candidate function 15/22...
    >>> loop of re-parameterization with less NDF for bad fits 1/8...

Re-optimizing parameterized candidate function 16/22...
    >>> loop of re-parameterization with less NDF for bad fits 1/8...

Re-optimizing parameterized candidate function 17/22...
    >>> loop of re-parameterization with less NDF for bad fits 5/16...

Re-optimizing parameterized candidate function 18/22...
    >>> loop of re-parameterization with less NDF for bad fits 4/16...

Re-optimizing parameterized candidate function 19/22...
    >>> loop of re-parameterization with less NDF for bad fits 7/32...

Re-optimizing parameterized candidate function 20/22...
    >>> loop of re-parameterization with less NDF for bad fits 2/32...

Re-optimizing parameterized candidate function 21/22...
    >>> loop of re-parameterization with less NDF for bad fits 3/32...

Re-optimizing parameterized candidate function 22/22...
    >>> loop of re-parameterization with less NDF for bad fits 2/32...

Save results to output files¶

Save results to csv tables:

candidates.csv: saves all candidate functions and evaluations in a csv table.
candidates_reduced.csv: saves a reduced version for essential information without intermediate results.

In [7]:

Copied!

model.save_to_csv(output_dir = 'output_Toy_dataset_2a/')
model.save_to_csv(output_dir = 'output_Toy_dataset_2a/')

Saving full results >>> output_Toy_dataset_2a/candidates.csv
Saving reduced results >>> output_Toy_dataset_2a/candidates_reduced.csv

Plot results to pdf files:

candidates.pdf: plots all candidate functions with associated uncertainties one by one for fit quality evaluation.
candidates_sampling.pdf: plots all candidate functions with total uncertainty coverage generated by sampling parameters.
candidates_gof.pdf: plots the goodness-of-fit scores.
candidates_correlation.pdf: plots the correlation matrices for the parameters of the candidate functions.

In [8]:

Copied!





model.plot_to_pdf(
    	output_dir = 'output_Toy_dataset_2a/',
    	bin_widths_1d = bin_widths_1d,
    	#bin_edges_2d = bin_edges_2d,
    	plot_logy = False,
    	plot_logx = False,
        sampling_95quantile = False
)
model.plot_to_pdf(
    	output_dir = 'output_Toy_dataset_2a/',
    	bin_widths_1d = bin_widths_1d,
    	#bin_edges_2d = bin_edges_2d,
    	plot_logy = False,
    	plot_logx = False,
        sampling_95quantile = False
)

Plotting candidate functions 22/22 >>> output_Toy_dataset_2a/candidates.pdf
Plotting candidate functions (sampling parameters) 22/22 >>> output_Toy_dataset_2a/candidates_sampling.pdf
Plotting correlation matrices 22/22 >>> output_Toy_dataset_2a/candidates_correlation.pdf
Plotting goodness-of-fit scores >>> output_Toy_dataset_2a/candidates_gof.pdf