# `gvar.dataset` - Random Data Sets¶

## Introduction¶

`gvar.dataset` contains a several tools for collecting and analyzing random samples from arbitrary distributions. The random samples are represented by lists of numbers or arrays, where each number/array is a new sample from the underlying distribution. For example, six samples from a one-dimensional gaussian distribution, 1±1, might look like

```>>> random_numbers = [1.739, 2.682, 2.493, -0.460, 0.603, 0.800]
```

while six samples from a two-dimensional distribution, [1±1, 2±1], might be

```>>> random_arrays = [[ 0.494, 2.734], [ 0.172, 1.400], [ 1.571, 1.304],
...                  [ 1.532, 1.510], [ 0.669, 0.873], [ 1.242, 2.188]]
```

Samples from more complicated multidimensional distributions are represented by dictionaries whose values are lists of numbers or arrays: for example,

```>>> random_dict = dict(n=random_numbers, a=random_arrays)
```

where list elements `random_dict['n'][i]` and `random_dict['a'][i]` are part of the same multidimensional sample for every `i` — that is, the lists for different keys in the dictionary are synchronized one with the other.

With large samples, we typically want to estimate the mean value of the underlying distribution. This is done using `gvar.dataset.avg_data()`: for example,

```>>> print(avg_data(random_numbers))
1.31(45)
```

indicates that `1.31(45)` is our best guess, based only upon the samples in `random_numbers`, for the mean of the distribution from which those samples were drawn. Similarly

```>>> print(avg_data(random_arrays))
[0.95(22) 1.67(25)]
```

indicates that the means for the two-dimensional distribution behind `random_arrays` are `[0.95(22), 1.67(25)]`. `avg_data()` can also be applied to a dictionary whose values are lists of numbers/arrays: for example,

```>>> print(avg_data(random_dict))
{'a': array([0.95(22), 1.67(25)], dtype=object),'n': 1.31(45)}
```

Class `gvar.dataset.Dataset` can be used to assemble dictionaries containing random samples. For example, imagine that the random samples above were originally written into a file, as they were generated:

```# file: datafile
n 1.739
a [ 0.494, 2.734]
n 2.682
a [ 0.172, 1.400]
n 2.493
a [ 1.571, 1.304]
n -0.460
a [ 1.532, 1.510]
n 0.603
a [ 0.669, 0.873]
n 0.800
a [ 1.242, 2.188]
```

Here each line is a different random sample, either from the one-dimensional distribution (labeled `n`) or from the two-dimensional distribution (labeled `a`). Assuming the file is called `datafile`, this data can be read into a dictionary, essentially identical to the `data` dictionary above, using:

```>>> data = Dataset("datafile")
>>> print(data['a'])
[array([ 0.494, 2.734]), array([ 0.172, 1.400]), array([ 1.571, 1.304]) ... ]
>>> print(avg_data(data['n']))
1.31(45)
```

The brackets and commas can be omitted in the input file for one-dimensional arrays: for example, `datafile` (above) could equivalently be written

```# file: datafile
n 1.739
a 0.494 2.734
n 2.682
a 0.172 1.400
...
```

Other data formats may also be easy to use. For example, a data file written using `yaml` would look like

```# file: datafile
---
n: 1.739
a: [ 0.494, 2.734]
---
n: 2.682
a: [ 0.172, 1.400]
.
.
.
```

and could be read into a `gvar.dataset.Dataset` using:

```import yaml

data = Dataset()
with open("datafile", "r") as dfile:
data.append(d)                      # d is a dictionary
```

Finally note that data can be binned, into bins of size `binsize`, using `gvar.dataset.bin_data()`. For example, `gvar.dataset.bin_data(data, binsize=3)` replaces every three samples in `data` by the average of those samples. This creates a dataset that is `1/3` the size of the original but has the same mean. Binning is useful for making large datasets more manageable, and also for removing sample-to-sample correlations. Over-binning, however, erases statistical information.

Class `gvar.dataset.Dataset` can also be used to build a dataset sample by sample in code: for example,

```>>> a = Dataset()
>>> a.append(n=1.739, a=[ 0.494, 2.734])
>>> a.append(n=2.682, a=[ 0.172, 1.400])
...
```

creates the same dataset as above.

## Functions¶

The functions defined in the module are:

gvar.dataset.avg_data(dataset, spread=False, median=False, bstrap=False, noerror=False, mismatch='truncate', warn=False)

Average `dataset` to estimate means and covariance.

`dataset` is a list of random numbers, a list of random arrays, or a dictionary of lists of random numbers and/or arrays: for example,

```>>> random_numbers = [1.60, 0.99, 1.28, 1.30, 0.54, 2.15]
>>> random_arrays = [[12.2,121.3],[13.4,149.2],[11.7,135.3],
...                  [7.2,64.6],[15.2,69.0],[8.3,108.3]]
>>> random_dict = dict(n=random_numbers,a=random_arrays)
```

where in each case there are six random numbers/arrays. `avg_data` estimates the means of the distributions from which the random numbers/arrays are drawn, together with the uncertainties in those estimates. The results are returned as a `gvar.GVar` or an array of `gvar.GVar`s, or a dictionary of `gvar.GVar`s and/or arrays of `gvar.GVar`s:

```>>> print(avg_data(random_numbers))
1.31(20)
>>> print(avg_data(random_arrays))
[11.3(1.1) 108(13)]
>>> print(avg_data(random_dict))
{'a': array([11.3(1.1), 108(13)], dtype=object),'n': 1.31(20)}
```

The arrays in `random_arrays` are one dimensional; in general, they can have any shape.

`avg_data(dataset)` also estimates any correlations between different quantities in `dataset`. When `dataset` is a dictionary, it does this by assuming that the lists of random numbers/arrays for the different `dataset[k]`s are synchronized, with the first element in one list corresponding to the first elements in all other lists, and so on.

Note that estimates of the correlations are robust only if the number of samples being averaged is substantially larger (eg, 10x) than the number of quantities being averaged. The correlation matrix is poorly conditioned or singular if the number of samples is too small. Function `gvar.dataset.svd_diagnosis()` can be used to determine whether there is a problem, and, if so, the problem can be ameliorated by applying an SVD cut to the data after averaging (`gvar.regulate()`).

There are four optional arguments. If argument `spread=True` each standard deviation in the results refers to the spread in the data, not the uncertainty in the estimate of the mean. The former is `sqrt(N)` larger where `N` is the number of random numbers (or arrays) being averaged:

```>>> print(avg_data(random_numbers,spread=True))
1.31(50)
>>> print(avg_data(random_numbers))
1.31(20)
>>> print((0.50 / 0.20) ** 2)   # should be (about) 6
6.25
```

This is useful, for example, when averaging bootstrap data. The default value is `spread=False`.

The second option is triggered by setting `median=True`. This replaces the means in the results by medians, while the standard deviations are approximated by the width of the larger interval above or below the median that contains 34% of the data. These estimates are more robust than the mean and standard deviation when averaging over small amounts of data; in particular, they are unaffected by extreme outliers in the data. The default is `median=False`.

The third option is triggered by setting `bstrap=True`. This is shorthand for setting `median=True` and `spread=True`, and overrides any explicit setting for these keyword arguments. This is the typical choice for analyzing bootstrap data — hence its name. The default value is `bstrap=False`.

The fourth option is to omit the error estimates on the averages, which is triggered by setting `noerror=True`. Just the mean values are returned. The default value is `noerror=False`.

The fifth option, `mismatch`, is only relevant when `dataset` is a dictionary whose entries `dataset[k]` have different sample sizes. This complicates the calculation of correlations between the different entries. There are three choices for `mismatch`:

`mismatch='truncate'`: Samples are discarded from the ends of entries so that all entries have the same sample size (equal to the smallest sample size). This is the default.

`mismatch='wavg'`: The data set is decomposed into a collection of data sets so that the entries in any given data set all have the same sample size; no samples are discarded. Each of the resulting data sets is averaged separately, and the final result is the weighted average of these averages. This choice is the most accurate but also the slowest, especially for large problems. It should only be used when the smallest sample size is much larger (eg, 10x) than the number of quantities being averaged. It also requires the `lsqfit` Python module.

`mismatch='decorrelate'`: Ignores correlations between different entries `dataset[k]`. All samples are used and correlations within an entry `dataset[k]` are retained. This is the fastest choice.

The final option `warn` determines whether or not a warning is issued when different entries in a dictionary data set have different sample sizes. The default is `warn=False`.

gvar.dataset.autocorr(dataset)

Compute autocorrelation in `dataset`.

`dataset` is a list of random numbers or random arrays, or a dictionary of lists of random numbers/arrays.

When `dataset` is a list of random numbers, `autocorr(dataset)` returns an array where `autocorr(dataset)[i]` is the correlation between elements in `dataset` that are separated by distance `i` in the list: for example,

```>>> print(autocorr([2,-2,2,-2,2,-2]))
[ 1. -1.  1. -1.  1. -1.]
```

shows perfect correlation between elements separated by an even interval in the list, and perfect anticorrelation between elements by an odd interval.

`autocorr(dataset)` returns a list of arrays of autocorrelation coefficients when `dataset` is a list of random arrays. Again `autocorr(dataset)[i]` gives the autocorrelations for `dataset` elements separated by distance `i` in the list. Similarly `autocorr(dataset)` returns a dictionary when `dataset` is a dictionary.

`autocorr(dataset)` uses FFTs to compute the autocorrelations; the cost of computing the autocorrelations should grow roughly linearly with the number of random samples in `dataset` (up to logarithms).

gvar.dataset.bin_data(dataset, binsize=2)

Bin random data.

`dataset` is a list of random numbers or random arrays, or a dictionary of lists of random numbers/arrays. `bin_data(dataset, binsize)` replaces consecutive groups of `binsize` numbers/arrays by the average of those numbers/arrays. The result is new data list (or dictionary) with `1/binsize` times as much random data: for example,

```>>> print(bin_data([1,2,3,4,5,6,7],binsize=2))
[1.5, 3.5, 5.5]
>>> print(bin_data(dict(s=[1,2,3,4,5],v=[[1,2],[3,4],[5,6],[7,8]]),binsize=2))
{'s': [1.5, 3.5], 'v': [array([ 2.,  3.]), array([ 6.,  7.])]}
```

Data is dropped at the end if there is insufficient data to from complete bins. Binning is used to make calculations faster and to reduce measurement-to-measurement correlations, if they exist. Over-binning erases useful information.

gvar.dataset.bootstrap_iter(dataset, n=None)

Create iterator that returns bootstrap copies of `dataset`.

`dataset` is a list of random numbers or random arrays, or a dictionary of lists of random numbers/arrays. `bootstrap_iter(dataset,n)` is an iterator that returns `n` bootstrap copies of `dataset`. The random numbers/arrays in a bootstrap copy are drawn at random (with repetition allowed) from among the samples in `dataset`: for example,

```>>> dataset = [1.1, 2.3, 0.5, 1.9]
>>> data_iter = bootstrap_iter(dataset)
>>> print(next(data_iter))
[ 1.1  1.1  0.5  1.9]
>>> print(next(data_iter))
[ 0.5  2.3  1.9  0.5]

>>> dataset = dict(a=[1,2,3,4],b=[1,2,3,4])
>>> data_iter = bootstrap_iter(dataset)
>>> print(next(data_iter))
{'a': array([3, 3, 1, 2]), 'b': array([3, 3, 1, 2])}
>>> print(next(data_iter))
{'a': array([1, 3, 3, 2]), 'b': array([1, 3, 3, 2])}

>>> dataset = [[1,2],[3,4],[5,6],[7,8]]
>>> data_iter = bootstrap_iter(dataset)
>>> print(next(data_iter))
[[ 7.  8.]
[ 1.  2.]
[ 1.  2.]
[ 7.  8.]]
>>> print(next(data_iter))
[[ 3.  4.]
[ 7.  8.]
[ 3.  4.]
[ 1.  2.]]
```

The distribution of bootstrap copies is an approximation to the distribution from which `dataset` was drawn. Consequently means, variances and correlations for bootstrap copies should be similar to those in `dataset`. Analyzing variations from bootstrap copy to copy is often useful when dealing with non-gaussian behavior or complicated correlations between different quantities.

Parameter `n` specifies the maximum number of copies; there is no maximum if `n is None`.

## Classes¶

`gvar.dataset.Dataset` is used to assemble random samples from multidimensional distributions:

class gvar.dataset.Dataset(inputdata=None, binsize=1, grep=None, keys=None, h5group='/', nbin=None)

Dictionary for collecting random data.

A `gvar.dataset.Dataset` is an ordered dictionary whose values represent collections of random samples. Each value is a `numpy` array whose first index labels the random sample. Random samples can be numbers or arrays of numbers. The keys identify the quantity being sampled.

A `Dataset` can be assembled piece by piece, as random data is accumulated, or it can be read from a file. Consider a situation where there are four random values for a scalar `s` and four random values for vector `v`. These can be collected as follows:

```>>> dset = Dataset()
>>> dset.append(s=1.1, v=[12.2, 20.6])
>>> dset.append(s=0.8, v=[14.1, 19.2])
>>> dset.append(s=0.95, v=[10.3, 19.7])
>>> dset.append(s=0.91, v=[8.2, 21.0])
>>> print(dset['s'])       # 4 random values of s
[ 1.1, 0.8, 0.95, 0.91]
>>> print(dset['v'])       # 4 random vector-values of v
[array([ 12.2,  20.6]), array([ 14.1,  19.2]), array([ 10.3,  19.7]), array([  8.2,  21. ])]
```

The argument to `dset.append()` can also be a dictionary: for example, `dd = dict(s=1.1,v=[12.2,20.6]); dset.append(dd)` is equivalent to the first `append` statement above. One can also append data key-by-key: for example, `dset.append('s',1.1); dset.append('v',[12.2,20.6])` is equivalent to the first `append` in the example above.

Use `extend` in place of `append` to add data in batches: for example,

```>>> dset = Dataset()
>>> dset.extend(s=[1.1, 0.8], v=[[12.2, 20.6], [14.1, 19.2]])
>>> dset.extend(s=[0.95, 0.91], v=[[10.3, 19.7],[8.2, 21.0]])
>>> print(dset['s'])       # 4 random values of s
[ 1.1, 0.8, 0.95, 0.91]
```

gives the same dataset as the first example above.

The same `Dataset` can also be created from a text file named `'datafile'` with the following contents:

```# file: datafile
s 1.1
v [12.2, 20.6]
s 0.8
v [14.1, 19.2]
s 0.95
v [10.3, 19.7]
s 0.91
v [8.2, 21.0]
```

Here each line consists of a key followed by a new random sample for that key. Lines that begin with `#` are ignored. The file is read using:

```>>> dset = Dataset('datafile')
>>> print(dset['s'])
[ 1.1, 0.8, 0.95, 0.91]
```

Data can be binned while reading it in, which might be useful if the data set is huge or if correlations are a concern. To bin the data contained in file `datafile` in bins of bin size 2 we use:

```>>> dset = Dataset('datafile', binsize=2)
>>> print(dset['s'])
[0.95, 0.93]
```

The keys read from a data file are restricted to those listed in keyword `keys` and those that are matched (or partially matched) by regular expression `grep` if one or other of these is specified: for example,

```>>> dset = Dataset('datafile')
>>> print([k for k in dset])
['s', 'v']
>>> dset = Dataset('datafile', keys=['v'])
>>> print([k for k in dset])
['v']
>>> dset = Dataset('datafile', grep='[^v]')
>>> print([k for k in dset])
['s']
>>> dset = Dataset('datafile', keys=['v'], grep='[^v]')
>>> print([k for k in dset])
[]
```

In addition to text files, hdf5 files can also be read (provided module `h5py` is available): for example,

```>>> dset = Dataset('datafile.h5', h5group='/mcdata')
```

reads the hdf5 datasets in hdf5 group `'/mcdata'`. An hdf5 equivalent to the text file above would contain two groups, one with key `'s'` that is a one-dimensional array with shape (4,), and another with key `'v'` that is a two-dimensional array with shape (4, 2):

```>>> import h5py
>>> for v in h5py.File('datafile.h5')['/mcdata'].values():
...     print(v)
<HDF5 dataset "s": shape (4,), type "<f8">
<HDF5 dataset "v": shape (4, 2), type "<f8">
```

Finally, `Dataset`s can also be constructed from other dictionaries (including other `Dataset`s), or lists of key-data tuples. For example,

```>>> dset = Dataset('datafile')
>>> dset_binned = Dataset(dset, binsize=2)
>>> dset_v = Dataset(dset, keys=['v'])
```

reads data from file `'datafile'` into `Dataset` `dset`, and then creates a new `Dataset` with the data binned (`dset_binned`), and another that only contains the data with key `'v'` (`dset_v`).

Parameters:
• inputdata (str or list or dictionary) –

If `inputdata` is a string, it is the name of a file containing datasets. Two formats are supported. If the filename ends in ‘.h5’, the file is in hdf5 format, with datasets that are `numpy` arrays whose first index labels the random sample.

The other file format is a text file where each line consists of a key followed by a number or array of numbers representing a new random sample associated with that key. Lines beginning with `#` are comments. A list of text file names can also be supplied, and text files can be compressed (with names ending in `.gz` or `.bz2`).

If `inputdata` is a dictionary or a list of (key,value) tuples, its keys and values are copied into the dataset. Its values should be arrays whose first index labels the random sample.

• binsize (int) – Bin the random samples in bins of size `binsize`. Default value is `binsize=1` (i.e., no binning).

• grep (str or `None`) – If not `None`, only keys that match or partially match regular expression `grep` are retained in the data set. Keys that don’t match are ignored. Default is `grep=None`.

• keys (list) – List of keys to retain in data set. Keys that are not in the list are ignored. Default is `keys=None` which implies that all keys are kept.

• h5group (str or list) – Address within the hdf5 file identified by `inputdata` that contains the relevant datasets. Every hdf5 dataset in group `h5group` is read into the dataset, with the same key as in `h5group`. Default is the top group in the file: `h5group='/'`. `h5group` can also be a list of groups, in which case datasets from all of the groups are read.

The main attributes and methods are:

samplesize

Smallest number of samples for any key.

append(*args, **kargs)

Append data to dataset.

There are three equivalent ways of adding data to a dataset `data`: for example, each of

```data.append(n=1.739,a=[0.494,2.734])        # method 1

data.append(n,1.739)                        # method 2
data.append(a,[0.494,2.734])

dd = dict(n=1.739,a=[0.494,2.734])          # method 3
data.append(dd)
```

adds one new random number to `data['n']`, and a new vector to `data['a']`.

extend(*args, **kargs)

There are three equivalent ways of adding batched data, containing multiple samples for each quantity, to a dataset `data`: for example, each of

```data.extend(n=[1.739,2.682],
a=[[0.494,2.734],[ 0.172, 1.400]])  # method 1

data.extend(n,[1.739,2.682])                    # method 2
data.extend(a,[[0.494,2.734],[ 0.172, 1.400]])

dd = dict(n=[1.739,2.682],
a=[[0.494,2.734],[ 0.172, 1.400]])  # method 3
data.extend(dd)
```

adds two new random numbers to `data['n']`, and two new random vectors to `data['a']`.

This method can be used to merge two datasets, whether or not they share keys: for example,

```data = Dataset("file1")
data_extra = Dataset("file2")
data.extend(data_extra)   # data now contains all of data_extra
```
grep(rexp)

Create new dataset containing items whose keys match `rexp`.

Returns a new `gvar.dataset.Dataset`` containing only the items `self[k]` whose keys `k` match regular expression `rexp` (a string) according to Python module `re`:

```>>> a = Dataset()
>>> a.append(xx=1.,xy=[10.,100.])
>>> a.append(xx=2.,xy=[20.,200.])
>>> print(a.grep('y'))
{'yy': [array([  10.,  100.]), array([  20.,  200.])]}
>>> print(a.grep('x'))
{'xx': [1.0, 2.0], 'xy': [array([  10.,  100.]), array([  20.,  200.])]}
>>> print(a.grep('x|y'))
{'xx': [1.0, 2.0], 'xy': [array([  10.,  100.]), array([  20.,  200.])]}
>>> print a.grep('[^y][^x]')
{'xy': [array([  10.,  100.]), array([  20.,  200.])]}
```

Items are retained even if `rexp` matches only part of the item’s key.

slice(sl)

Create new dataset with `self[k] -> self[k][sl].`

Parameter `sl` is a slice object that is applied to every item in the dataset to produce a new `gvar.Dataset`. Setting `sl = slice(0,None,2)`, for example, discards every other sample for each quantity in the dataset. Setting `sl = slice(100,None)` discards the first 100 samples for each quantity.

If parameter `sl` is a tuple of slice objects, these are applied to successive indices of `self[k]`. An exception is called if the number of slice objects exceeds the number of dimensions for any `self[k]`.

arrayzip(template)

Merge lists of random data according to `template`.

`template` is an array of keys in the dataset, where the shapes of `self[k]` are the same for all keys `k` in `template`. `self.arrayzip(template)` merges the lists of random numbers/arrays associated with these keys to create a new list of (merged) random arrays whose layout is specified by `template`: for example,

```>>> d = Dataset()
>>> d.append(a=1,b=10)
>>> d.append(a=2,b=20)
>>> d.append(a=3,b=30)
>>> print(d)            # three random samples each for a and b
{'a': [1.0, 2.0, 3.0], 'b': [10.0, 20.0, 30.0]}
>>> # merge into list of 2-vectors:
>>> print(d.arrayzip(['a','b']))
[[  1.  10.]
[  2.  20.]
[  3.  30.]]
>>> # merge into list of (symmetric) 2x2 matrices:
>>> print(d.arrayzip([['b','a'],['a','b']]))
[[[ 10.   1.]
[  1.  10.]]

[[ 20.   2.]
[  2.  20.]]

[[ 30.   3.]
[  3.  30.]]]
```

The number of samples in each merged result is the same as the number samples for each key (here 3). The keys used in this example represent scalar quantities; in general, they could be either scalars or arrays (of any shape, so long as all have the same shape).

trim()

Create new dataset where all entries have same sample size.

toarray()

Create new dictionary `d` where `d[k]=numpy.array(self[k])` for all `k`.

class gvar.dataset.svd_diagnosis(dataset, nbstrap=50, mincut=1e-12, models=None, process_dataset=None)

Diagnose the need for an SVD cut.

`gvar.dataset.svd_diagnosis` bootstraps the spectrum of the correlation matrix for the data in `dataset` to determine how much of that spectrum is reliably determined by this data.

Here `dataset` is a list of random arrays or a dictionary (e.g., `gvar.dataset.Dataset`) whose values are lists of random numbers or random arrays. The random numbers or arrays are averaged (using `gvar.dataset.avg_data()`) to produce a set `gvar.GVar`s and their correlation matrix. The smallest eigenvalues of the correlation matrix are poorly estimated when the number of random samples is insufficiently large — the number of samples should typically be significantly larger than the number of random variables being analyzed in order to get good estimates of the correlations between these variables.

Typical usage is

```import gvar as gv

s = gv.dataset.svd_diagnosis(dataset)
avgdata = gv.svd(s.avgdata, svdcut=s.svdcut)
s.plot_ratio(show=True)
```

where the defective part of the correlation matrix is corrected by applying an SVD cut to the averaged data. A plot showing the ratio of bootstrapped eigenvalues to the actual eigenvalues is displayed by the `s.plot_ratio` command.

Parameters:
• dataset – List of random arrays or a dictionary (e.g., `gvar.dataset.Dataset`) whose values are lists of random numbers or random arrays. Alternatively it can be a tuple `(g, Ns)` where: `g` is an array of `gvar.GVar`s or a dictionary whose values are `gvar.GVar`s or arrays of `gvar.GVar`s; and `Ns` is the number of random samples. Then the list of random data that is analyzed is created is created using `gvar.raniter(g, n=Ns)`.

• nbstrap – Number of bootstrap copies used (default is 50).

• models – For use in conjunction with `lsqfit.MultiFitter`; ignored when not specified. When specified, it is a list of multi- fitter models used to specify which parts of the data are being analyzed. The correlation matrix is restricted to the data specified by the models and the data returned are “processed data” for use with a multi-fitter using keyword `pdata` rather than `data`. Ignored if keyword `process_datasets` is specified.

• process_dataset – Function that converts datasets into averaged data. Function `gvar.dataset.avg_data()` is used if set equal to `None` (default).

• mincut – Minimum SVD cut (default 1e-12).

The main attributes are:

svdcut

SVD cut for bad eigenvalues in correlation matrix.

eps

`eps` corresponding to `svdcut`, for use in `gvar.regulate()`.

avgdata

Averaged data (`gvar.dataset.avg_data(dataset)`).

val

Eigenvalues of the correlation matrix.

bsval

Bootstrap average of correlation matrix eigenvalues.

nmod

Number of eigenmodes modified by SVD cut `svdcut`.

A method is available to display the eigenvalues:

plot_ratio(plot=None, show=False)

Plot ratio of bootstrapped eigenvalues divided by actual eigenvalues.

Ratios (blue points) are plotted versus the value of the actual eigenvalues divided by the maximum eigenvalue. Error bars on the ratios show the range of variation across bootstrapped copies. A dotted line is drawn at `1 - sqrt(2/N)`, where `N` is the number of data points. The proposed SVD cut is where the ratio curve intersects this line; that point is indicated by a vertical dashed red line. The plot object is returned.

Parameters:
• plot`matplotlib` plotter used to make plot. Uses `plot = matplotlib.pyplot` if `plot=None` (default).

• show – Displays the plot if `show=True` (default `False`).