Toy dataset 2a (1D)¶

See >>notebook<< for the complete procedure.

This fit generates 22 candidate functions in total! The output files can be found here (feel free to download them and look at what a typical fit will produce).

Lets look at the output file candidates_reduced.csv, which is a csv table storing all candidate functions and their evaluations:

Unnamed: 0	Parameterized equation, unscaled	Parameters: (best-fit, +1, -1)	Covariance	Correlation	RMSE	R2	NDF	Chi2	Chi2/NDF	p-value
0	a1	{'a1': (2.05528, 0.0655, -0.0655)}	{}	{}	0.361	-0.04203	8	35.92	4.49	1.813e-05
1	a1	{'a1': (2.05528, 0.0655, -0.0655)}	{}	{}	0.361	-0.04203	8	35.92	4.49	1.813e-05
2	exp(a1**x0)	{'a1': (0.902627, 0.0106, -0.0106)}	{}	{}	0.2347	0.5595	8	25.38	3.173	0.001339
3	a1*x0 + a2	{'a1': (-0.152, 0, 0), 'a2': (2.54597, 0.0532, -0.0532)}	{}	{}	0.2439	0.5245	8	23.7	2.962	0.002575
4	a1*exp(x0) + a2	{'a1': (-0.0062, 0, 0), 'a2': (2.25253, 0.043, -0.043)}	{}	{}	0.2749	0.3958	8	15.45	1.931	0.05101
5	a1x0*x0 + a2	{'a1': (-0.000627, 0, 0), 'a2': (2.14336, 0.0361, -0.0361)}	{}	{}	0.2955	0.3019	8	10.9	1.363	0.2073
6	a1exp(x0)*a3 + a2	{'a1': (-5.48e-07, 0, 0), 'a2': (2.12684, 0.0417, -0.0417), 'a3': (3.06639, 0.0545, -0.0545)}	{'a2, a3': 0.0009089538934345906, 'a2, a2': 0.0017370330904972918, 'a3, a3': 0.0029662557027436004}	{'a2, a3': 0.4}	0.3005	0.278	7	10.67	1.524	0.1537
7	a2 + tanh(a3x0*a1)	{'a1': (-6.08353, 0.16, -0.16), 'a2': (1.12791, 0.0414, -0.0414), 'a3': (5000.0, 0, 0)}	{'a1, a2': -0.003455724117867078, 'a1, a1': 0.025558616293521953, 'a2, a2': 0.001711506471301966}	{'a1, a2': -0.5217}	0.2978	0.291	7	9.103	1.3	0.2453
8	a1x0x0 + a2*x0 + a3	{'a1': (-0.000611, 0, 0), 'a2': (0.139, 0, 0), 'a3': (2.13367, 0.0314, -0.0314)}	{}	{}	0.1838	0.7298	8	8.224	1.028	0.4119
9	a2x0x0 + a3 + tanh(x0)*a1	{'a1': (-0.722, 0, 0), 'a2': (-0.000601, 0, 0), 'a3': (1.12779, 0.0293, -0.0293)}	{}	{}	0.1027	0.9157	8	7.171	0.8963	0.5183
10	a1x0x0 + a2x0a4 + a3	{'a1': (-0.000611827, 0.000119, -0.000119), 'a2': (0.0138, 0, 0), 'a3': (2.13469, 0.0337, -0.0337), 'a4': (7.3, 0, 0)}	{'a1, a3': -1.97765605821829e-06, 'a1, a1': 1.4078135307392676e-08, 'a3, a3': 0.00113567478364232}	{'a1, a3': -0.4931}	0.07183	0.9587	7	6.276	0.8966	0.5079
11	a1exp(x0)a4 + a2(1/x0)**a4 + a3	{'a1': (-5.02e-07, 0, 0), 'a2': (0.106, 0, 0), 'a3': (2.11447, 0.0312, -0.0312), 'a4': (3.0801, 0.0417, -0.0417)}	{'a3, a4': 0.0005179246074593511, 'a3, a3': 0.0009723101922214652, 'a4, a4': 0.001739255816631455}	{'a3, a4': 0.3981}	0.0709	0.9598	7	5.985	0.855	0.5415
12	a1exp(x0)a4 + a3 + (a2/x0)*exp(x0)	{'a1': (-5.02e-07, 0, 0), 'a2': (0.458, 0, 0), 'a3': (2.1187, 0.0308, -0.0308), 'a4': (3.08168, 0.041, -0.041)}	{'a3, a4': 0.0005038357787305179, 'a3, a3': 0.0009468645821696876, 'a4, a4': 0.0016801026110262708}	{'a3, a4': 0.399}	0.0694	0.9615	7	5.822	0.8317	0.5607
13	a1exp(x0)a4 + a3 + (a2/x0)(a4x0)	{'a1': (-5.02e-07, 0, 0), 'a2': (0.461, 0, 0), 'a3': (2.11896, 0.0307, -0.0307), 'a4': (3.08183, 0.0409, -0.0409)}	{'a3, a4': 0.0005045971504331848, 'a3, a3': 0.000945445358799716, 'a4, a4': 0.0016749950329208738}	{'a3, a4': 0.4019}	0.06817	0.9628	7	5.805	0.8292	0.5627
14	a1exp(x0) + a2x0**2 + a3/x0 + tanh(x0)	{'a1': (-0.0277038, 0.00251, -0.00251), 'a2': (0.136813, 0.00958, -0.00958), 'a3': (1.34298, 0.101, -0.101)}	{'a1, a2': -2.297757180249665e-05, 'a1, a3': 0.0001261412440401108, 'a2, a3': -0.0006393350924087656, 'a1, a1': 6.279777411879649e-06, 'a2, a2': 9.186158351658068e-05, 'a3, a3': 0.010181491486377781}	{'a1, a2': -0.9556, 'a1, a3': 0.4976, 'a2, a3': -0.6608}	0.07962	0.9493	6	3.182	0.5303	0.7857
15	a1exp(x0) + a2x0*2 + a3exp(-x0) + tanh(x0)	{'a1': (-0.033981, 0.0023, -0.0023), 'a2': (0.176283, 0.00791, -0.00791), 'a3': (4.11033, 0.308, -0.308)}	{'a1, a2': -1.75558643003482e-05, 'a1, a3': 0.00023867130735472176, 'a2, a3': -0.0010333653548920232, 'a1, a1': 5.285502177782489e-06, 'a2, a2': 6.263051054791136e-05, 'a3, a3': 0.09468084737369323}	{'a1, a2': -0.965, 'a1, a3': 0.3369, 'a2, a3': -0.4242}	0.08057	0.9481	6	3.16	0.5266	0.7886
16	a1exp(x0) + a2x02 + a3/x0 + tanh(x0a4)	{'a1': (-0.0271627, 0.00237, -0.00237), 'a2': (0.134446, 0.00906, -0.00906), 'a3': (1.3585, 0.0954, -0.0954), 'a4': (1.76, 0, 0)}	{'a1, a2': -2.0533780214173956e-05, 'a1, a3': 0.00011272542649262147, 'a2, a3': -0.0005713382764988577, 'a1, a1': 5.611888422300035e-06, 'a2, a2': 8.209159706999988e-05, 'a3, a3': 0.009098633670575626}	{'a1, a2': -0.9563, 'a1, a3': 0.4986, 'a2, a3': -0.661}	0.05072	0.9794	6	2.844	0.4739	0.8282
17	a1exp(x0) + a2x02 + a4/x0 + tanh(x0(a3 + x0))	{'a1': (-0.0273534, 0.00235, -0.00235), 'a2': (0.135433, 0.00897, -0.00897), 'a3': (1.07, 0, 0), 'a4': (1.34171, 0.0944, -0.0944)}	{'a1, a2': -2.0126210919177028e-05, 'a1, a4': 0.00011048797083236222, 'a2, a4': -0.0005599979418291376, 'a1, a1': 5.5004996091764936e-06, 'a2, a2': 8.046218378024949e-05, 'a4, a4': 0.008918037419015503}	{'a1, a2': -0.9548, 'a1, a4': 0.4981, 'a2, a4': -0.6613}	0.05042	0.9797	6	2.787	0.4645	0.835
18	a2exp(x0) + a4 + a5/x0 + tanh(x0) + tanh(a3x0*(a1 + x0))	{'a1': (-1.94, 0, 0), 'a2': (-0.0121, 0, 0), 'a3': (0.174755, 0.0289, -0.0289), 'a4': (0.471884, 0.0921, -0.0921), 'a5': (1.16928, 0.143, -0.143)}	{'a3, a4': -0.002300578761032394, 'a3, a5': 0.0021352199001804274, 'a4, a5': -0.010982581385669675, 'a3, a3': 0.000834193085319076, 'a4, a4': 0.008479022052957617, 'a5, a5': 0.020560132183072337}	{'a3, a4': -0.8643, 'a3, a5': 0.5167, 'a4, a5': -0.8339}	0.07573	0.9541	6	2.604	0.4341	0.8566
19	a2exp(x0) + a4 + a5/x0 + tanh(x0) + tanh(a3x0*(a1 + x0))	{'a1': (-2.33, 0, 0), 'a2': (-0.0148236, 0.00105, -0.00105), 'a3': (0.176639, 0.0291, -0.0291), 'a4': (0.739794, 0.113, -0.113), 'a5': (1.00747, 0.178, -0.178)}	{'a2, a3': 2.571387859970431e-06, 'a2, a4': -6.599195891650885e-05, 'a2, a5': 7.938840593704372e-05, 'a3, a4': -0.0025163923280050664, 'a3, a5': 0.0030927690268355767, 'a4, a5': -0.01824821711602555, 'a2, a2': 1.099476773081912e-06, 'a3, a3': 0.0008470311174969069, 'a4, a4': 0.012823335794655933, 'a5, a5': 0.031541049354375975}	{'a2, a3': 0.08416, 'a2, a4': -0.5562, 'a2, a5': 0.4248, 'a3, a4': -0.7653, 'a3, a5': 0.5971, 'a4, a5': -0.9072}	0.0551	0.9757	5	2.266	0.4532	0.8112
20	a2exp(x0) + a4 + a5/x0 + tanh(x02) + tanh(a3x0*(a1 + x0))	{'a1': (-2.01054, 0.188, -0.188), 'a2': (-0.0121, 0, 0), 'a3': (0.193281, 0.0358, -0.0358), 'a4': (0.421164, 0.0829, -0.0829), 'a5': (1.27121, 0.152, -0.152)}	{'a1, a3': -0.004070676171719144, 'a1, a4': -2.6608463209886575e-05, 'a1, a5': -0.01281056135474584, 'a3, a4': -0.001982765933058178, 'a3, a5': 0.0033640618921915097, 'a4, a5': -0.009376396445189625, 'a1, a1': 0.03523722733305552, 'a3, a3': 0.001279963729104309, 'a4, a4': 0.006866933230097807, 'a5, a5': 0.02316194919935461}	{'a1, a3': -0.6048, 'a1, a4': -0.001707, 'a1, a5': -0.4483, 'a3, a4': -0.6681, 'a3, a5': 0.6182, 'a4, a5': -0.7441}	0.05252	0.9779	5	2.03	0.4061	0.8449
21	a2exp(x0) + a4 + a5/x0 + tanh(x0(2x0)) + tanh(a3x0(a1 + x0))	{'a1': (-2.33, 0, 0), 'a2': (-0.0145878, 0.00101, -0.00101), 'a3': (0.190041, 0.0286, -0.0286), 'a4': (0.695267, 0.109, -0.109), 'a5': (1.03399, 0.169, -0.169)}	{'a2, a3': 5.5374989952583414e-06, 'a2, a4': -6.842481228318895e-05, 'a2, a5': 8.235783476269107e-05, 'a3, a4': -0.002427940863820981, 'a3, a5': 0.002953379843732268, 'a4, a5': -0.01682770436697607, 'a2, a2': 1.0105467737819519e-06, 'a3, a3': 0.000819363289913555, 'a4, a4': 0.011961320256874186, 'a5, a5': 0.028694971571777278}	{'a2, a3': 0.1917, 'a2, a4': -0.6215, 'a2, a5': 0.4825, 'a3, a4': -0.7788, 'a3, a5': 0.611, 'a4, a5': -0.9135}	0.04895	0.9808	5	2.02	0.404	0.8464

The goodness-of-fit scores are plotted in candidates_gof.pdf, such as the chi2/ndf:

For other goodness-of-fit scores:

Click to expand

^ p-value

^ Root-mean-square error

^ Coefficient of determination R2

Now, lets take a look at one of the candidate functions, say candidate #20. The functional form can be found in the corresponding plots from the PDF files and in the csv table above, which is:

a2*exp(x0) + a4 + a5/x0 + tanh(x0**2) + tanh(a3*x0*(a1 + x0)).

Unlike the previous example (toy1), here we have set input_rescale = False and scale_y_by = None when configuring the fits since the x and y of this dataset are already O(1) and there is no need to scale them to prevent numerical overflow (of course you still can if you want). Therefore the functions here appear slightly cleaner, i.e., no overall normaliztion (rescaling y by c*(...)) and no un-standardization (rescaling x -> c*(x-b)).

This candidate function has 5 parameters, originally: a1, a2, a3, a4, a5. However, there are only 4 final varying parameters: a1, a3, a4, a5, as can be seen from the Parameters: (best-fit, +1, -1) column in the csv tables or directly from the pdf files:

{'a1': (-2.01054, 0.188, -0.188), 'a2': (-0.0121, 0, 0), 'a3': (0.193281, 0.0358, -0.0358), 'a4': (0.421164, 0.0829, -0.0829), 'a5': (1.27121, 0.152, -0.152)}

where a2 has zeros at both +1 and -1 unc entries, meaning this parameter was held fixed during the re-optimization. This is because during the re-optimization loop, the objective function was too complex to minimize, therefore some parameters are held fixed to lower the number of degrees of freedom in order to achieve a better fit. This is common when the functions or the distribution shapes are not very simple.

To see how this candidate function behaves when each of these 4 parameters is varied to its +/-1 sigma value:

Click to expand

^ +/-1 sigma variations of parameter a1

^ +/-1 sigma variations of parameter a3

^ +/-1 sigma variations of parameter a4

^ +/-1 sigma variations of parameter a5

^ Correlation matrix

As shown in the correlation matrix, these parameters are not all independent to each other, so it will be nice to see the actual uncertainty coverage considering uncertainties from all parameters in a candidate function. These are plotted in candidates_sampling.pdf. Here, what it does is to generate an ensemble of functions for a candidate function by sampling its parameters, where the sampling is done by sampling from a multidimensional normal distribution for the parameters, with the best-fit parameter values being the mean location and the covariance matrix for the parameters being the covarience. In this way, the total uncertainty is obtained by considering uncertainties from all parameters simultaneously. Then the 68% quantile range of this function ensemble as green bands in the plots and compared with the input data.

Note the 95% quantile range can also be added by sampling_95quantile = True.