SymbolFit Class¶
Arguments¶
Dataset¶
-
x: list | ndarrayDefault: None
Independent variable x, or bin center values for histogram data. If provided as a python list, e.g.,
[1, 2, 3,...]for 1D,[[1, 1], [1, 2], [1, 3],...]for 2D,[[1, 1, 1], [1, 1, 2], [1, 1, 3],...]for 3D etc. If provided as ndarray, then shape is (num_examples, dim). -
y: list | ndarrayDefault: None
Dependent variable y, or bin content values for histogram data. Shape is (num_examples, 1).
-
y_up: list | ndarrayDefault: 1
Upper one standard deviation of y (+1 sigma). It should be the absolute deviation value (not relative) and non-negative. Shape is (num_examples, 1).
If your data has no uncertainty, set both
y_upandy_downto1(the default) and setfit_y_unc = Falseso all data points are weighted equally in the fit. -
y_down: list | ndarrayDefault: 1
Lower one standard deviation of y (-1 sigma). It should be the absolute deviation value (not relative) and non-negative. Shape is (num_examples, 1).
Asymmetric uncertainties are supported: the fit automatically uses
y_upwhen the residual is positive andy_downwhen negative.
See input data format for a graphical illustration of how to prepare x, y, y_up, y_down, and bin widths/edges.
Fit configuration¶
-
pysr_config: pysr.PySRRegressor classDefault: built-in config with +, *, /, ^
Configuration for the PySR symbolic regression search. This controls which mathematical operators are available, how many iterations to run, population size, and other search hyperparameters. See PySR documentation for all available options.
The configuration can be stored in a python file like pysr_config.py:
from pysr import PySRRegressor pysr_config = PySRRegressor( model_selection = 'accuracy', niterations = 100, maxsize = 50, binary_operators = ['+', '*'], unary_operators = ['exp', 'tanh'], ... )and source from there:
import importlib pysr_config = importlib.import_module('directory.pysr_config').pysr_config model = SymbolFit(..., pysr_config=pysr_config, ...)Tip
The choice of operators is the most impactful setting. Start with operators that match the expected behavior of your data (e.g.,
expfor exponential decays,gaussfor peaked distributions). See PySR config examples for common configurations. -
max_complexity: intDefault: 40
Maximum complexity of the expression tree. Each operator and variable counts toward this budget (e.g.,
a1 * exp(a2 * x0)has complexity ~6). Higher values allow more complex functions but increase search time and risk overfitting.Overwrites the
maxsizeparameter inPySRRegressor()if provided.Tip
Start with 40-60 for most cases. Increase if the search consistently returns functions that are too simple to capture the data shape. Decrease if functions are overfitting or the search is too slow.
-
input_rescale: boolDefault: True
Rescale x to the range (0, 1) before fitting. This prevents numerical instability or overflow when x values are very large or span many orders of magnitude. All fitted functions are automatically unscaled in the output, so the final expressions are in terms of the original x.
Tip
Keep this enabled (True) unless you have a specific reason not to. Disabling it can cause fits to fail when x values are large.
-
scale_y_by: str | NoneDefault: 'mean'
Normalize y before fitting. Options:
'mean','max','l2', orNone(no normalization). Likeinput_rescale, the final output functions are unscaled back to original units. Only applies wheninput_rescaleis True.Tip
Use
'mean'for most cases. Set toNoneif others don't work well. -
max_stderr: float (%)Default: 20
Maximum allowed relative uncertainty (in %) for any single parameter during the LMFIT re-optimization stage. If any parameter exceeds this threshold, the fit is considered unreliable and is retried with fewer free parameters (some are held fixed at their initial values from PySR).
This acts as a quality gate by preventing the final results from containing parameters with meaninglessly large uncertainties.
Tip
Values of 10-40 work well in practice. Lower values are stricter (more parameters may be frozen), higher values are more permissive. If many of your candidates show frozen parameters, try increasing this.
-
fit_y_unc: boolDefault: True
Whether to use
y_up/y_downas weights in the fit loss function. When True, the loss is chi2-weighted:(y_pred - y_true)^2 / y_unc^2, wherey_uncis taken asy_upwhen the residual is positive andy_downwhen negative.Set to False for an unweighted (least-squares) fit where all data points contribute equally, regardless of their uncertainties. This is useful when uncertainties are not available or not meaningful.
-
random_seed: int | NoneDefault: None
Set to an integer to make the symbolic regression search reproducible. When set, PySR is forced to run in single-threaded mode, which makes runs slower but guarantees identical results across runs.
Leave as
Nonefor the fastest search (multi-threaded, non-deterministic). Since the function space is vast, rerunning withrandom_seed = Nonenaturally produces different candidates each time, which can be useful for exploring the solution space. -
loss_weights: list | ndarray | NoneDefault: None
Custom per-bin weights for the fit loss. When provided, the loss becomes
(y_pred - y_true)^2 * loss_weightsand overrides they_up/y_downuncertainty weighting. Shape is (num_examples, 1).This is useful when you want to emphasize certain regions of the data (e.g., assign higher weights to a signal region) or de-emphasize others.
Methods¶
fit()¶
Performs the full SymbolFit pipeline:
- Runs PySR to search for candidate functional forms.
- Parameterizes all numerical constants (replaces them with named parameters
a1,a2, ...). - Re-optimizes each candidate with LMFIT to refine parameter values and provide uncertainty estimation (re-optimization fit, or ROF).
save_to_csv()¶
Saves all candidate functions and their evaluation metrics to CSV files.
candidates.csv: full results including intermediate fit details, parameterization, covariance matrices, and goodness-of-fit metrics.-
candidates_compact.csv: compact version with only the final functions, parameters, and key metrics for quick inspection. -
output_dir: strDefault: './'
Output directory. Created automatically if it does not exist.
plot_to_pdf()¶
Generates diagnostic plots for all candidate functions.
candidates.pdf: each candidate plotted against the data with parameter-by-parameter uncertainty variations, plus residual and ratio panels.candidates_sampling.pdf: total uncertainty coverage bands generated by Monte Carlo sampling of parameters using their covariance matrix (1D only).candidates_gof.pdf: summary of goodness-of-fit metrics (Chi2/NDF, RMSE, R2, p-value) across all candidates for comparison.-
candidates_correlation.pdf: parameter correlation matrices for each candidate. -
output_dir: strDefault: './'
Output directory. Created automatically if it does not exist.
Options for 1D data
-
bin_widths_1d: list | ndarrayDefault: None
Bin widths corresponding to each x value. When provided, data points are plotted as histogram bars instead of scatter points. Shape is (num_examples, 1). See input data format for a graphical illustration.
-
plot_logx: boolDefault: False
Use logarithmic scale for the x-axis in candidates.pdf.
-
plot_logy: boolDefault: False
Use logarithmic scale for the y-axis in candidates.pdf.
-
sampling_95quantile: boolDefault: False
Whether to include the 95% quantile range (in addition to the default 68% range) when plotting total uncertainty coverage in candidates_sampling.pdf. Enable this to visualize wider uncertainty bands.
Options for 2D data
-
bin_edges_2d: listDefault: None
Bin edges for plotting 2D histogram data, provided as a list of two sub-lists:
[[x0_0, x0_1, ...], [x1_0, x1_1, ...]]. The leftmost bin in x0 has edgesx0_0andx0_1.[x0_0, x0_1, ...]has(num_x0_bins + 1)elements and[x1_0, x1_1, ...]has(num_x1_bins + 1)elements. This must be a python list (not ndarray) since the two sub-lists can have different lengths. -
plot_logx0: boolDefault: False
Use logarithmic scale for the x0-axis in 2D plots.
-
plot_logx1: boolDefault: False
Use logarithmic scale for the x1-axis in 2D plots.
-
plot_logy: boolDefault: False
Use logarithmic scale for the y-axis (color scale) in 2D plots.
-
cbar_min: floatDefault: None
Minimum value for the color bar range in 2D plots. If None, determined automatically from the data.
-
cbar_max: floatDefault: None
Maximum value for the color bar range in 2D plots. If None, determined automatically from the data.
-
cmap: strDefault: None
Matplotlib colormap name for 2D plots (e.g.,
'viridis','coolwarm','RdBu_r'). If None, uses the matplotlib default.
print_candidate()¶
Print candidate functions with fully substituted parameter values to the terminal.
-
candidate_number: intDefault: 99
Print a specific candidate by its number (as shown in the output CSV/PDF), or set to
99to print all candidates.