# Case Study: Correlations and SVD Cuts¶

This case study illustrates a problem that arises when constructing correlation matrices from a small number of random samples. It shows how to fix the problem using an SVD cut.

## The Problem¶

We want to determine the slope indicated by measurements of a quantity

```y(x[i]) = y0 + s * x[i]
```

for `x=[1,2...10]`. The measurements are noisy so we average 13 sets `y_sample[j]` of independent measurements:

```import numpy as np
import gvar as gv

x = np.array([1., 2., 3., 4., 5., 6., 7., 8., 9., 10.])
y_samples = [
[2.8409,  4.8393,  6.8403,  8.8377, 10.8356, 12.8389, 14.8356, 16.8362, 18.8351, 20.8341],
[2.8639,  4.8612,  6.8597,  8.8559, 10.8537, 12.8525, 14.8498, 16.8487, 18.8460, 20.8447],
[3.1048,  5.1072,  7.1071,  9.1076, 11.1090, 13.1107, 15.1113, 17.1134, 19.1145, 21.1163],
[3.0710,  5.0696,  7.0708,  9.0705, 11.0694, 13.0681, 15.0693, 17.0695, 19.0667, 21.0678],
[3.0241,  5.0223,  7.0198,  9.0204, 11.0191, 13.0193, 15.0198, 17.0163, 19.0154, 21.0155],
[2.9719,  4.9700,  6.9709,  8.9706, 10.9707, 12.9705, 14.9699, 16.9686, 18.9676, 20.9686],
[3.0688,  5.0709,  7.0724,  9.0730, 11.0749, 13.0776, 15.0790, 17.0800, 19.0794, 21.0795],
[3.1471,  5.1468,  7.1452,  9.1451, 11.1429, 13.1445, 15.1450, 17.1435, 19.1425, 21.1432],
[3.0233,  5.0233,  7.0225,  9.0224, 11.0225, 13.0216, 15.0224, 17.0217, 19.0208, 21.0222],
[2.8797,  4.8792,  6.8803,  8.8794, 10.8800, 12.8797, 14.8801, 16.8797, 18.8803, 20.8812],
[3.0388,  5.0407,  7.0409,  9.0439, 11.0443, 13.0459, 15.0455, 17.0479, 19.0493, 21.0505],
[3.1353,  5.1368,  7.1376,  9.1367, 11.1360, 13.1377, 15.1369, 17.1400, 19.1384, 21.1396],
[3.0051,  5.0063,  7.0022,  9.0052, 11.0040, 13.0033, 15.0007, 16.9989, 18.9994, 20.9995],
]
y = gv.dataset.avg_data(y_samples)
```

The result is an array of 10 `gvar.GVar`s

```>>> print(y)
[3.013(27) 5.013(27) 7.013(27) 9.013(27) 11.012(27) 13.013(27) 15.013(28)
17.013(28) 19.012(28) 21.013(28)]
```

that are highly correlated:

```>>> print(gv.evalcorr(y)[:4,:4])
[[1.         0.99990406 0.99973156 0.99959261]
[0.99990406 1.         0.99985848 0.99982468]
[0.99973156 0.99985848 1.         0.99987618]
[0.99959261 0.99982468 0.99987618 1.        ]]
```

To extract a slope we fit these data using the `lsqfit` module:

```import lsqfit

def fcn(p):
return p['y0'] + p['s'] * x

prior = gv.gvar(dict(y0='0(5)', s='0(5)'))
fit = lsqfit.nonlinear_fit(data=y, fcn=fcn, prior=prior)
print(fit)
```

The fit, however, is very poor, with a `chi**2` per degree of freedom of 8.3:

```Least Square Fit:
chi2/dof [dof] = 8.3     Q = 1.1e-13    logGBF = 11.816

Parameters:
y0     0.963 (12)      [  0.0 (5.0) ]
s   2.00078 (18)      [  0.0 (5.0) ]

Settings:
svdcut/n = 1e-12/0    tol = (1e-08*,1e-10,1e-10)    (itns/time = 5/0.0)
```

The problem is that we do not have enough samples in `y_sample` to determine the correlation matrix sufficiently accurately. The smallest eigenvalues of the correlation matrix tend to be underestimated with small samples. Indeed the smallest eigenvalues go to zero when the sample size is smaller than the dimension of `y` (i.e., 10 here). The underestimated eigenvalues result in contributions to the `chi**2` function in the fit that are both spurious and large.

## A Poor Solution¶

One solution is to declare the correlations unreliable and to discard them, keeping just the individual standard deviations:

```y = gv.gvar(gv.mean(y), gv.sdev(y))

fit = lsqfit.nonlinear_fit(data=y, fcn=fcn, prior=prior)
print(fit)
```

This gives an acceptable fit,

```Least Square Fit:
chi2/dof [dof] = 0.02     Q = 1    logGBF = 12.924

Parameters:
y0    1.013 (18)     [  0.0 (5.0) ]
s   1.9999 (30)     [  0.0 (5.0) ]

Settings:
svdcut/n = 1e-12/0    tol = (1e-08*,1e-10,1e-10)    (itns/time = 5/0.0)
```

but the very small `chi**2` confirms what we suspect: that we are ignoring very strong correlations that are relevant to the fit. Not surprisingly, the accuracy of our slope determination is quite sensitive to these correlations.

## A Better Solution¶

A better solution is to determine which of the correlation matrix’s eigenvalues are accurate and retain those in the fit. We do this with `gvar.dataset.svd_diagnosis()` which uses a bootstrap analysis to investigate the accuracy and stability of the eigenvalues. Adding the code

```svd = gv.dataset.svd_diagnosis(y_samples)
svd.plot_ratio(show=True)
```

displays a plot showing the ratio of the bootstrap estimate for each eigenvalue divided by the real eigenvalue: The bootstrap tests the stability of eigenvalues with limited sample sizes. Bootstrap estimates that are significantly lower than the real values indicate eigenvalues that are likely unreliable. Here bootstrap eigenvalues agree well with the real values for the upper half of the spectrum, but are all low for the lower half. The standard deviation for the chi-squared per degree of freedom is indicated by the dotted (bottom) line in the plot; the SVD cut is chosen so that (most) eigenvalues that fall below this line are modified. The bootstrap errors give a sense for how accurately the underlying eigenvalues are determined given the sample size.

The plot shows that the fitting problem lies with the eigenvalues that are smaller than roughly 10-5 times the largest eigenvalue. To address this problem we introduce an SVD cut using `gvar.svd()` with a value for `svdcut` suggested by `gvar.dataset.svd_diagnosis()` (dotted red line in the figure):

```y = gv.svd(y, svdcut=svd.svdcut)

fit = lsqfit.nonlinear_fit(data=y, fcn=fcn, prior=prior)
print(fit)
```

`gv.svd(y, svdcut=svd.svdcut)` creates a new version of the data `y` with a correlation matrix whose large eigenvalues are unchanged but whose small eigenvalues, below `svdcut*max_eig`, are all set equal to `svdcut*max_eig` (where `max_eig` is the largest eigenvalue). This probably overestimates the uncertainties associated with the small eigenvalues, and so is a conservative move. It makes the correlation matrix less singular, and fixes the fit:

```Least Square Fit:
chi2/dof [dof] = 0.9     Q = 0.53    logGBF = 45.208

Parameters:
y0     1.008 (22)      [  0.0 (5.0) ]
s   2.00001 (22)      [  0.0 (5.0) ]

Settings:
svdcut/n = 1e-12/0    tol = (1e-08*,1e-10,1e-10)    (itns/time = 5/0.0)
```

Our final estimate for the slope is `s = 2.00001(22)`, whose uncertainty is more than an order-of-magnitude smaller than what we obtained from the uncorrelated fit.

This simple problem can be approached in different ways. For example, we could estimate the slope from `y[i+1] - y[i]`, doing a weighted average over all values of `i`:

```slope = lsqfit.wavg(y[1:] - y[:-1])
print(slope)
```

This again gives a slope of `2.00001(22)` provided an SVD cut has first been applied to `y`.

SVD cuts are often necessary when using correlation matrices constructed from random samples. Typically large numbers of samples are needed to calculate all of a correlation matrix’s eigenvalues accurately — 10–100 times as many samples as there are variables, or more. Such large numbers of samples are often not feasible, in which case an SVD cut might be essential for a usable correlation matrix.