Why SymbolFit?¶
The problem: manual function guessing¶
Whenever you need to find a smooth, closed-form function to model a dataset, whether it consists of scattered data points or a binned histogram, the traditional approach is tedious and manual:
- Guess a functional form (polynomial, exponential, Gaussian, etc.).
- Fit it to the data.
- Evaluate the fit quality.
- If it's not good enough, go back to step 1 and try a different form.
This trial-and-error process can take dozens to hundreds of iterations, and the function that works for one dataset rarely generalizes to another.
For simple cases (e.g., a linear trend or a clean exponential decay), this works fine. But what if the data has a complex shape, e.g., a peak followed by a long tail, multiple overlapping features, or a distribution that doesn't match any textbook function? Manually constructing and testing candidates becomes extremely time-consuming, especially when you need reliable uncertainty estimates on top of the fit.
This inefficiency shows up across many fields. In experimental high-energy physics (HEP), for example, some new physics searches at the CERN Large Hadron Collider (LHC) require empirical background modeling with custom functions hand-crafted for each analysis, including recent searches in dijet, trijet, paired-dijet, diphoton, and dimuon channels, and even the analyses that led to the Higgs boson discovery by ATLAS and CMS.
SymbolFit automates this entire process.
How it works: symbolic regression¶
Instead of requiring you to specify a functional form upfront, symbolic regression searches for functions that fit the data. It constructs and evolves mathematical expressions using a given set of operators, dynamically combining them until it finds the best ones.
A common approach is genetic programming, where functions are represented as expression trees. New candidate functions are created through mutation (changing a node) and crossover (swapping subtrees between two candidates):

You define the search space by choosing which operators to allow (e.g., +, *, exp, tanh). The search handles the rest without needing prior knowledge of the final functional form.
The SymbolFit pipeline¶
SymbolFit wraps the full modeling workflow into a single automated pipeline:

Step 1: Function search (PySR)
SymbolFit interfaces with PySR, a high-performance symbolic regression library, to search for functional forms that fit the data. PySR returns a batch of candidate functions per run, ranging from simple to complex.
Step 2: Parameterization and re-optimization (LMFIT)
The initial candidates from PySR have hard-coded numerical constants that may not be fully optimized, and they lack uncertainty estimates. SymbolFit addresses this by:
- Identifying all numerical constants in each candidate and replacing them with named parameters (
a1,a2, ...). - Re-optimizing these parameters using LMFIT (nonlinear least-squares minimization), which refines the values and provides uncertainty estimates via covariance matrices.
Step 3: Evaluation and output
Every candidate function is automatically evaluated with standard goodness-of-fit metrics (chi2/NDF, p-value, RMSE, R2) and saved with full diagnostic information:
- CSV tables with functions, parameters, uncertainties, and scores
- PDF plots showing each candidate against data, uncertainty variations, sampling-based uncertainty bands, goodness-of-fit summaries, and parameter correlation matrices
All results are ready for downstream use without additional processing.
An example¶
Below is an example demonstrating that a single run of SymbolFit generates a variety of candidate functions, illustrating the convergence from less complex to more complex functions that can effectively fit a nontrivial distribution shape.

Introductory slides can also be found here.