from symbolfit.symbolfit import *
Detected IPython. Loading juliacall extension. See https://juliapy.github.io/PythonCall.jl/stable/compat/#IPython
Dataset¶
Five inputs are needed, which can be python lists or numpy arrays (more options will be added in future!):
x: independent variable (bin center location).y: dependent variable.y_up: upward uncertainty in y per bin.y_down: downward uncertainty in y per bin.bin_widths_1dbin widths in x.
- Elements in both y_up and y_down should be non-negative values.
- These values are the "delta" in y,
- y + y_up = y shifted up by one standard deviation.
- y - y_down = y shifted down by one standard deviation.
- If no uncertainty in the dataset, one can set both y_up and y_down to ones with the same shape as x.
x = [0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5]
y = [3.0, 2.2, 2.1, 2.05, 2, 2.1, 2.2, 1.9, 1.6]
y_up = [0.4, 0.3, 0.2, 0.2, 0.1, 0.05, 0.06, 0.1, 0.1]
y_down = [0.4, 0.3, 0.2, 0.2, 0.1, 0.05, 0.06, 0.1, 0.1]
bin_widths_1d = [0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5]
Plot the dataset to see what we will be fitting to:
fig, axes = plt.subplots(figsize = (6, 4))
plt.errorbar(np.array(x).flatten(),
np.array(y).flatten(),
yerr = [np.array(y_down).flatten(), np.array(y_up).flatten()],
xerr = np.array(bin_widths_1d)/2,
fmt = '.', c = 'black', ecolor = 'grey', capsize = 0,
)
plt.savefig('img/toy2a/dataset.png')
Configure the fit¶
Configure PySR to define the function space being searched for with symbolic regression:
from pysr import PySRRegressor
pysr_config = PySRRegressor(
model_selection = 'accuracy',
niterations = 200,
maxsize = 80,
binary_operators = [
'+', '*', '/', '^'
],
unary_operators = [
'exp',
'tanh',
],
nested_constraints = {
'exp': {'exp': 0, 'tanh': 0, '*': 2, '/': 1, '^': 1},
'tanh': {'exp': 0, 'tanh': 0, '*': 2, '/': 1, '^': 1},
'*': {'exp': 1, 'tanh': 1, '*': 2, '/': 1, '^': 1},
'^': {'exp': 1, 'tanh': 1, '*': 2, '/': 1, '^': 0},
'/': {'exp': 1, 'tanh': 1, '*': 2, '/': 0, '^': 1},
},
loss='loss(y, y_pred, weights) = (y - y_pred)^2 * weights',
)
Here, we allow four binary operators (+, *, /, pow) and two unary operators (exp, tanh) when searching for functional forms. The custom-defined gauss in the previous example may not be needed here since it this dataset is not obvious with a peak.
Nested constraints are imposed to prohibit, e.g., exp(exp(x))...
Loss function is a weighted MSE, where the weight is the sqaured uncertainty by default in SymbolFit.
For PySR options, please see:
Configure SymbolFit with the PySR config and for the re-optimization process:
model = SymbolFit(
# Dataset: x, y, y_up, y_down.
x = x,
y = y,
y_up = y_up,
y_down = y_down,
# PySR configuration of the function space.
pysr_config = pysr_config,
# Constrain the maximum function size and over-write maxsize in pysr_config.
# Set a higher value for more complex shape, or when the lower one does not fit well.
max_complexity = 30,
# Whether to scale input x to be within 0 and 1 for the fits for numerical stability,
# as large x could lead to overflow when there is e.g. exp(x) -> exp(10000).
# So set this to False when your x's are or close to O(1), otherwise recommended to set True.
# After the fits, the functions will be unscaled to relect the original dataset.
input_rescale = False,
# ^ no scaling needed here since the input x is O(1).
# Whether to scale y for the fits for numerical stability,
# options are (when input_rescale is True): None / 'mean' / 'max' / 'l2'.
# This is useful to stabilize fits when your y's are very large or very small.
# After the fits, the functions will be unscaled to relect the original dataset.
scale_y_by = None,
# ^ no scaling needed here since the input y is O(1).
# Set a maximum standard error (%) for all parameters to avoid bad fits during re-optimization.
# In the refit loop, when any of the parameters returns a standard error larger than max_stderr,
# the fit is considered failed, and the fit will retry itself for fewer or other combination of varying parameters,
# by freezing some of the parameters to their initial values and kept fixed during re-optimization.
# This is to avoid bad fits when the objective is too complex to minimize, which could cause some parameters
# to have unrealistically large standard errors.
# In most cases 10 < max_stderr < 100 suffices.
max_stderr = 20,
# Consider y_up and y_down to weight the MSE loss during SR search and re-optimization.
fit_y_unc = True,
# Set a random seed for returning the same batch of functional forms every time (single-threaded),
# otherwise set None to explore more functions every time (multi-threaded and faster).
# In most cases the function space is huge, one can retry the fits with the exact same fit configuration
# and get completely different sets of candidate functions, merely by using different random seeds.
# So if the candidate functions are not satisfactory this time, rerun it few times more with
# random_seed = None or a different seed each time.
random_seed = None,
# Custome loss weight to set "(y - y_pred)^2 * loss_weights", overwriting that with y_up and y_down.
loss_weights = None
)
Symbol Fit it!¶
Run the fits: SR fit for functional form searching -> parameterization -> re-optimization fit for improved best-fits and uncertainty estimation -> evaluation.
model.fit()
Compiling Julia backend...
[ Info: Started!
Expressions evaluated per second: 7.180e+05
Head worker occupation: 16.4%
Progress: 1457 / 3000 total iterations (48.567%)
====================================================================================================
Hall of Fame:
---------------------------------------------------------------------------------------------------
Complexity Loss Score Equation
1 1.752e-01 1.594e+01 y = 1.68
2 3.437e-02 1.629e+00 y = exp(0.72041)
5 2.267e-02 1.387e-01 y = (-0.15192 * x₀) + 2.5457
6 1.478e-02 4.280e-01 y = 2.2524 + (exp(x₀) * -0.0061958)
7 1.043e-02 3.484e-01 y = (-0.00062681 * (x₀ ^ x₀)) + 2.1433
8 1.021e-02 2.138e-02 y = (-5.2253e-07 * (exp(x₀) ^ 3.0764)) + 2.1265
10 1.016e-02 2.503e-03 y = tanh(-2.9879e-05 * ((x₀ + x₀) ^ x₀)) + 2.1267
11 7.875e-03 2.548e-01 y = (2.1336 + (-0.00061524 * (x₀ ^ x₀))) + (0.13061 ^ x₀)
12 7.452e-03 5.516e-02 y = (2.1165 + (-5.0177e-07 * (exp(x₀) ^ 3.0805))) + (0.15228 ^...
x₀)
13 6.008e-03 2.153e-01 y = (2.1336 + (-0.00061524 * (x₀ ^ x₀))) + ((0.013848 ^ x₀) / ...
0.13724)
14 5.845e-03 2.759e-02 y = ((-5.0177e-07 * (exp(x₀) ^ 3.0805)) + 2.1165) + ((0.14706 ...
^ x₀) / x₀)
15 5.702e-03 2.468e-02 y = ((0.47413 / x₀) ^ 3.1654) + ((-3.4388e-07 * (exp(x₀) ^ 3.1...
654)) + exp(0.74782))
16 5.567e-03 2.404e-02 y = ((0.45735 / x₀) ^ exp(x₀)) + ((-3.445e-07 * (exp(x₀) ^ 3.1...
654)) + exp(0.75035))
17 5.551e-03 2.948e-03 y = ((0.45735 / x₀) ^ (3.172 * x₀)) + (((exp(x₀) ^ 3.172) * -3...
.3335e-07) + exp(0.75035))
18 5.549e-03 3.421e-04 y = ((0.42627 / x₀) ^ (exp(x₀) * x₀)) + (exp(0.75035) + (-3.44...
5e-07 * (exp(x₀) ^ 3.1654)))
19 3.413e-03 4.859e-01 y = ((x₀ ^ 0.40686) + (exp((0.47803 / x₀) ^ x₀) + ((exp(x₀) ^ ...
1.8685) * -0.00019322))) + -0.39961
21 3.393e-03 2.935e-03 y = (-0.045569 + ((x₀ ^ 0.38808) + (exp((0.46993 / x₀) ^ x₀) +...
((exp(x₀) ^ 1.8773) * -0.00017942)))) + -0.32154
22 3.382e-03 3.239e-03 y = ((x₀ ^ 0.40686) + (exp((0.47803 / (x₀ * 0.95372)) ^ x₀) + ...
((exp(x₀) ^ 1.8685) * tanh(-0.00019322)))) + -0.39961
23 3.249e-03 4.024e-02 y = -0.33587 + ((x₀ ^ 0.37768) + (((exp(x₀) ^ 1.7986) * -0.000...
25789) + exp(0.95391 / (x₀ + (x₀ ^ (x₀ + x₀))))))
24 2.757e-03 1.643e-01 y = (-0.33118 + (((exp(0.41338 ^ x₀) + (exp(x₀) * -0.029169)) ...
+ (x₀ ^ tanh(-0.54561))) + (x₀ * (0.14758 * x₀)))) + 0.045004
26 2.748e-03 1.661e-03 y = (-0.33118 + (((x₀ ^ tanh(-0.54561)) + (exp(0.41338 ^ x₀) +...
(exp(x₀) * -0.029169))) + (x₀ * (0.14758 * x₀)))) + (0.045004...
* 1.1576)
28 2.735e-03 2.299e-03 y = ((((exp(x₀) * -0.029169) + (exp(0.41338 ^ x₀) + 0.20771)) ...
+ ((x₀ ^ -0.54561) + (-0.12587 / tanh(x₀)))) + (x₀ * (x₀ * 0.1...
4758))) + -0.33118
29 2.725e-03 3.784e-03 y = -0.40398 + (((exp((0.39274 + 0.36498) / (x₀ + (x₀ ^ -0.497...
47))) + (exp(x₀) * -0.029507)) + ((x₀ * 0.84467) ^ -0.75566)) ...
+ ((x₀ * 0.15314) * x₀))
30 2.622e-03 3.847e-02 y = -0.3954 + ((((exp(x₀) * -0.028641) + exp(0.38721 / (x₀ ^ x...
₀))) + ((tanh(x₀ ^ x₀) / x₀) + 0.36897)) + ((x₀ * (x₀ * 0.1544...
)) ^ 0.94872))
---------------------------------------------------------------------------------------------------
====================================================================================================
Press 'q' and then <enter> to stop execution early.
Checking if pysr_model_temp.pkl exists...
Loading model from pysr_model_temp.pkl
Re-optimizing parameterized candidate function 1/22...
>>> loop of re-parameterization with less NDF for bad fits 1/2...
Re-optimizing parameterized candidate function 2/22...
>>> loop of re-parameterization with less NDF for bad fits 1/2...
Re-optimizing parameterized candidate function 3/22...
>>> loop of re-parameterization with less NDF for bad fits 1/2...
Re-optimizing parameterized candidate function 4/22...
>>> loop of re-parameterization with less NDF for bad fits 2/4...
Re-optimizing parameterized candidate function 5/22...
>>> loop of re-parameterization with less NDF for bad fits 2/4...
Re-optimizing parameterized candidate function 6/22...
>>> loop of re-parameterization with less NDF for bad fits 2/4...
Re-optimizing parameterized candidate function 7/22...
>>> loop of re-parameterization with less NDF for bad fits 2/8...
Re-optimizing parameterized candidate function 8/22...
>>> loop of re-parameterization with less NDF for bad fits 4/8...
Re-optimizing parameterized candidate function 9/22...
>>> loop of re-parameterization with less NDF for bad fits 5/8...
Re-optimizing parameterized candidate function 10/22...
>>> loop of re-parameterization with less NDF for bad fits 5/8...
Re-optimizing parameterized candidate function 11/22...
>>> loop of re-parameterization with less NDF for bad fits 10/16...
Re-optimizing parameterized candidate function 12/22...
>>> loop of re-parameterization with less NDF for bad fits 6/16...
Re-optimizing parameterized candidate function 13/22...
>>> loop of re-parameterization with less NDF for bad fits 6/16...
Re-optimizing parameterized candidate function 14/22...
>>> loop of re-parameterization with less NDF for bad fits 6/16...
Re-optimizing parameterized candidate function 15/22...
>>> loop of re-parameterization with less NDF for bad fits 1/8...
Re-optimizing parameterized candidate function 16/22...
>>> loop of re-parameterization with less NDF for bad fits 1/8...
Re-optimizing parameterized candidate function 17/22...
>>> loop of re-parameterization with less NDF for bad fits 5/16...
Re-optimizing parameterized candidate function 18/22...
>>> loop of re-parameterization with less NDF for bad fits 4/16...
Re-optimizing parameterized candidate function 19/22...
>>> loop of re-parameterization with less NDF for bad fits 7/32...
Re-optimizing parameterized candidate function 20/22...
>>> loop of re-parameterization with less NDF for bad fits 2/32...
Re-optimizing parameterized candidate function 21/22...
>>> loop of re-parameterization with less NDF for bad fits 3/32...
Re-optimizing parameterized candidate function 22/22...
>>> loop of re-parameterization with less NDF for bad fits 2/32...
Save results to output files¶
Save results to csv tables:
candidates.csv: saves all candidate functions and evaluations in a csv table.candidates_reduced.csv: saves a reduced version for essential information without intermediate results.
model.save_to_csv(output_dir = 'output_Toy_dataset_2a/')
Saving full results >>> output_Toy_dataset_2a/candidates.csv Saving reduced results >>> output_Toy_dataset_2a/candidates_reduced.csv
Plot results to pdf files:
candidates.pdf: plots all candidate functions with associated uncertainties one by one for fit quality evaluation.candidates_sampling.pdf: plots all candidate functions with total uncertainty coverage generated by sampling parameters.candidates_gof.pdf: plots the goodness-of-fit scores.candidates_correlation.pdf: plots the correlation matrices for the parameters of the candidate functions.
model.plot_to_pdf(
output_dir = 'output_Toy_dataset_2a/',
bin_widths_1d = bin_widths_1d,
#bin_edges_2d = bin_edges_2d,
plot_logy = False,
plot_logx = False,
sampling_95quantile = False
)
Plotting candidate functions 22/22 >>> output_Toy_dataset_2a/candidates.pdf Plotting candidate functions (sampling parameters) 22/22 >>> output_Toy_dataset_2a/candidates_sampling.pdf Plotting correlation matrices 22/22 >>> output_Toy_dataset_2a/candidates_correlation.pdf Plotting goodness-of-fit scores >>> output_Toy_dataset_2a/candidates_gof.pdf