gvar.dataset - Random Data Sets

Introduction

gvar.dataset contains a several tools for collecting and analyzing random samples from arbitrary distributions. The random samples are represented by lists of numbers or arrays, where each number/array is a new sample from the underlying distribution. For example, six samples from a one-dimensional gaussian distribution, 1±1, might look like

>>> random_numbers = [1.739, 2.682, 2.493, -0.460, 0.603, 0.800]

while six samples from a two-dimensional distribution, [1±1, 2±1], might be

>>> random_arrays = [[ 0.494, 2.734], [ 0.172, 1.400], [ 1.571, 1.304],
...                  [ 1.532, 1.510], [ 0.669, 0.873], [ 1.242, 2.188]]

Samples from more complicated multidimensional distributions are represented by dictionaries whose values are lists of numbers or arrays: for example,

>>> random_dict = dict(n=random_numbers, a=random_arrays)

where list elements random_dict['n'][i] and random_dict['a'][i] are part of the same multidimensional sample for every i — that is, the lists for different keys in the dictionary are synchronized one with the other.

With large samples, we typically want to estimate the mean value of the underlying distribution. This is done using gvar.dataset.avg_data(): for example,

>>> print(avg_data(random_numbers))
1.31(45)

indicates that 1.31(45) is our best guess, based only upon the samples in random_numbers, for the mean of the distribution from which those samples were drawn. Similarly

>>> print(avg_data(random_arrays))
[0.95(22) 1.67(25)]

indicates that the means for the two-dimensional distribution behind random_arrays are [0.95(22), 1.67(25)]. avg_data() can also be applied to a dictionary whose values are lists of numbers/arrays: for example,

>>> print(avg_data(random_dict))
{'a': array([0.95(22), 1.67(25)], dtype=object),'n': 1.31(45)}

Class gvar.dataset.Dataset can be used to assemble dictionaries containing random samples. For example, imagine that the random samples above were originally written into a file, as they were generated:

# file: datafile
n 1.739
a [ 0.494, 2.734]
n 2.682
a [ 0.172, 1.400]
n 2.493
a [ 1.571, 1.304]
n -0.460
a [ 1.532, 1.510]
n 0.603
a [ 0.669, 0.873]
n 0.800
a [ 1.242, 2.188]

Here each line is a different random sample, either from the one-dimensional distribution (labeled n) or from the two-dimensional distribution (labeled a). Assuming the file is called datafile, this data can be read into a dictionary, essentially identical to the data dictionary above, using:

>>> data = Dataset("datafile")
>>> print(data['a'])
[array([ 0.494, 2.734]), array([ 0.172, 1.400]), array([ 1.571, 1.304]) ... ]
>>> print(avg_data(data['n']))
1.31(45)

The brackets and commas can be omitted in the input file for one-dimensional arrays: for example, datafile (above) could equivalently be written

# file: datafile
n 1.739
a 0.494 2.734
n 2.682
a 0.172 1.400
...

Other data formats may also be easy to use. For example, a data file written using yaml would look like

# file: datafile
---
n: 1.739
a: [ 0.494, 2.734]
---
n: 2.682
a: [ 0.172, 1.400]
.
.
.

and could be read into a gvar.dataset.Dataset using:

import yaml

data = Dataset()
with open("datafile", "r") as dfile:
    for d in yaml.load_all(dfile.read()):   # iterate over yaml records
        data.append(d)                      # d is a dictionary

Finally note that data can be binned, into bins of size binsize, using gvar.dataset.bin_data(). For example, gvar.dataset.bin_data(data, binsize=3) replaces every three samples in data by the average of those samples. This creates a dataset that is 1/3 the size of the original but has the same mean. Binning is useful for making large datasets more manageable, and also for removing sample-to-sample correlations. Over-binning, however, erases statistical information.

Class gvar.dataset.Dataset can also be used to build a dataset sample by sample in code: for example,

>>> a = Dataset()
>>> a.append(n=1.739, a=[ 0.494, 2.734])
>>> a.append(n=2.682, a=[ 0.172, 1.400])
...

creates the same dataset as above.

Functions

The functions defined in the module are:

gvar.dataset.avg_data(dataset, spread=False, median=False, bstrap=False, noerror=False, mismatch='truncate', unbias=False, warn=False)

Average dataset to estimate means and covariance.

dataset is a list of random numbers, a list of random arrays, or a dictionary of lists of random numbers and/or arrays: for example,

>>> random_numbers = [1.60, 0.99, 1.28, 1.30, 0.54, 2.15]
>>> random_arrays = [[12.2,121.3],[13.4,149.2],[11.7,135.3],
...                  [7.2,64.6],[15.2,69.0],[8.3,108.3]]
>>> random_dict = dict(n=random_numbers,a=random_arrays)

where in each case there are six random numbers/arrays. avg_data estimates the means of the distributions from which the random numbers/arrays are drawn, together with the uncertainties in those estimates. The results are returned as a gvar.GVar or an array of gvar.GVars, or a dictionary of gvar.GVars and/or arrays of gvar.GVars:

>>> print(avg_data(random_numbers))
1.31(20)
>>> print(avg_data(random_arrays))
[11.3(1.1) 108(13)]
>>> print(avg_data(random_dict))
{'a': array([11.3(1.1), 108(13)], dtype=object),'n': 1.31(20)}

The arrays in random_arrays are one dimensional; in general, they can have any shape.

avg_data(dataset) also estimates any correlations between different quantities in dataset. When dataset is a dictionary, it does this by assuming that the lists of random numbers/arrays for the different dataset[k]s are synchronized, with the first element in one list corresponding to the first elements in all other lists, and so on.

Note that estimates of the correlations are robust only if the number of samples being averaged is substantially larger (eg, 10x) than the number of quantities being averaged. The correlation matrix is poorly conditioned or singular if the number of samples is too small. Function gvar.dataset.svd_diagnosis() can be used to determine whether there is a problem, and, if so, the problem can be ameliorated by applying an SVD cut to the data after averaging (gvar.regulate()).

Parameters:
  • stderr (bool) – If True (default) the variances and covariances represent the uncertainties in the estimates of the means (corresponding to the standard errors). If False they represent the spread in the data (corresponding to the standard deviations of the distribution). The former is smaller by a factor of one over the number of samples.

  • spread (bool) – (Legacy) spread=True is the same as stderr=False, and vice versa.

  • median (bool) – Setting median=True replaces the means in the results by medians, while the standard deviations are approximated by the width of the larger of the intervals above or below the median that contains 34% of the data. These estimates are more robust than the mean and standard deviation when averaging over small amounts of data; in particular, they are unaffected by extreme outliers in the data. But correlations between different variables are ignored in this case. The default is median=False.

  • bstrap (bool) – Setting bstrap=True is shorthand for setting median=True and stderr=False, and overrides any explicit setting for these keyword arguments. This is the typical choice for analyzing bootstrap data — hence its name. The default value is bstrap=False.

  • noerror (bool) – Error estimates on the averages (and correlations) are omitted if noerror=True is set — just the mean values are returned. The default value is noerror=False.

  • mismatch (str) –

    The mismatch option is only relevant when dataset is a dictionary whose entries dataset[k] have different sample sizes. This complicates the calculation of correlations between the different entries. There are three choices for mismatch:

    mismatch='truncate':

    Samples are discarded from the ends of entries so that all entries have the same sample size (equal to the smallest sample size). This is the default.

    mismatch='wavg':

    The data set is decomposed into a collection of data sets so that the entries in any given data set all have the same sample size; no samples are discarded. Each of the resulting data sets is averaged separately, and the final result is the weighted average of these averages. This choice is the most accurate but also the slowest, especially for large problems. It should only be used when the smallest sample size is much larger (eg, 10x) than the number of quantities being averaged. It also requires the lsqfit Python module.

    mismatch='decorrelate':

    Ignores correlations between different entries dataset[k]. All samples are used and correlations within an entry dataset[k] are retained. This is the fastest choice.

  • unbias (bool) – The default choice unbias=False means that variances are estimated using sum((x[i] - mean(x))**2)/N where N is the number of random samples. This estimate is biased but is typically more accurate than the unbiased estimate obtained when N is replaced by N-1 (because statistical fluctuations are then larger than the bias). The difference is negligible unless N is very small. Setting unbias=True gives the unbiased estimate.

  • warn (bool) – warn determines whether or not a warning is issued when different entries in a dictionary data set have different sample sizes. The default is warn=False.

gvar.dataset.autocorr(dataset)

Compute autocorrelation in dataset.

dataset is a list of random numbers or random arrays, or a dictionary of lists of random numbers/arrays.

When dataset is a list of random numbers, autocorr(dataset) returns an array where autocorr(dataset)[i] is the correlation between elements in dataset that are separated by distance i in the list: for example,

>>> print(autocorr([2,-2,2,-2,2,-2]))
[ 1. -1.  1. -1.  1. -1.]

shows perfect correlation between elements separated by an even interval in the list, and perfect anticorrelation between elements by an odd interval.

autocorr(dataset) returns a list of arrays of autocorrelation coefficients when dataset is a list of random arrays. Again autocorr(dataset)[i] gives the autocorrelations for dataset elements separated by distance i in the list. Similarly autocorr(dataset) returns a dictionary when dataset is a dictionary.

autocorr(dataset) uses FFTs to compute the autocorrelations; the cost of computing the autocorrelations should grow roughly linearly with the number of random samples in dataset (up to logarithms).

gvar.dataset.bin_data(dataset, binsize=2)

Bin random data.

dataset is a list of random numbers or random arrays, or a dictionary of lists of random numbers/arrays. bin_data(dataset, binsize) replaces consecutive groups of binsize numbers/arrays by the average of those numbers/arrays. The result is new data list (or dictionary) with 1/binsize times as much random data: for example,

>>> print(bin_data([1,2,3,4,5,6,7],binsize=2))
[1.5, 3.5, 5.5]
>>> print(bin_data(dict(s=[1,2,3,4,5],v=[[1,2],[3,4],[5,6],[7,8]]),binsize=2))
{'s': [1.5, 3.5], 'v': [array([ 2.,  3.]), array([ 6.,  7.])]}

Data is dropped at the end if there is insufficient data to from complete bins. Binning is used to make calculations faster and to reduce measurement-to-measurement correlations, if they exist. Over-binning erases useful information.

gvar.dataset.bootstrap_iter(dataset, n=None)

Create iterator that returns bootstrap copies of dataset.

dataset is a list of random numbers or random arrays, or a dictionary of lists of random numbers/arrays. bootstrap_iter(dataset,n) is an iterator that returns n bootstrap copies of dataset. The random numbers/arrays in a bootstrap copy are drawn at random (with repetition allowed) from among the samples in dataset: for example,

>>> dataset = [1.1, 2.3, 0.5, 1.9]
>>> data_iter = bootstrap_iter(dataset)
>>> print(next(data_iter))
[ 1.1  1.1  0.5  1.9]
>>> print(next(data_iter))
[ 0.5  2.3  1.9  0.5]

>>> dataset = dict(a=[1,2,3,4],b=[1,2,3,4])
>>> data_iter = bootstrap_iter(dataset)
>>> print(next(data_iter))
{'a': array([3, 3, 1, 2]), 'b': array([3, 3, 1, 2])}
>>> print(next(data_iter))
{'a': array([1, 3, 3, 2]), 'b': array([1, 3, 3, 2])}

>>> dataset = [[1,2],[3,4],[5,6],[7,8]]
>>> data_iter = bootstrap_iter(dataset)
>>> print(next(data_iter))
[[ 7.  8.]
 [ 1.  2.]
 [ 1.  2.]
 [ 7.  8.]]
>>> print(next(data_iter))
[[ 3.  4.]
 [ 7.  8.]
 [ 3.  4.]
 [ 1.  2.]]

The distribution of bootstrap copies is an approximation to the distribution from which dataset was drawn. Consequently means, variances and correlations for bootstrap copies should be similar to those in dataset. Analyzing variations from bootstrap copy to copy is often useful when dealing with non-gaussian behavior or complicated correlations between different quantities.

Parameter n specifies the maximum number of copies; there is no maximum if n is None.

Classes

gvar.dataset.Dataset is used to assemble random samples from multidimensional distributions:

class gvar.dataset.Dataset(inputdata=None, binsize=1, grep=None, keys=None, h5group='/', nbin=None)

Dictionary for collecting random data.

A gvar.dataset.Dataset is an ordered dictionary whose values represent collections of random samples. Each value is a numpy array whose first index labels the random sample. Random samples can be numbers or arrays of numbers. The keys identify the quantity being sampled.

A Dataset can be assembled piece by piece, as random data is accumulated, or it can be read from a file. Consider a situation where there are four random values for a scalar s and four random values for vector v. These can be collected as follows:

>>> dset = Dataset()
>>> dset.append(s=1.1, v=[12.2, 20.6])
>>> dset.append(s=0.8, v=[14.1, 19.2])
>>> dset.append(s=0.95, v=[10.3, 19.7])
>>> dset.append(s=0.91, v=[8.2, 21.0])
>>> print(dset['s'])       # 4 random values of s
[ 1.1, 0.8, 0.95, 0.91]
>>> print(dset['v'])       # 4 random vector-values of v
[array([ 12.2,  20.6]), array([ 14.1,  19.2]), array([ 10.3,  19.7]), array([  8.2,  21. ])]

The argument to dset.append() can also be a dictionary: for example, dd = dict(s=1.1,v=[12.2,20.6]); dset.append(dd) is equivalent to the first append statement above. One can also append data key-by-key: for example, dset.append('s',1.1); dset.append('v',[12.2,20.6]) is equivalent to the first append in the example above.

Use extend in place of append to add data in batches: for example,

>>> dset = Dataset()
>>> dset.extend(s=[1.1, 0.8], v=[[12.2, 20.6], [14.1, 19.2]])
>>> dset.extend(s=[0.95, 0.91], v=[[10.3, 19.7],[8.2, 21.0]])
>>> print(dset['s'])       # 4 random values of s
[ 1.1, 0.8, 0.95, 0.91]

gives the same dataset as the first example above.

The same Dataset can also be created from a text file named 'datafile' with the following contents:

# file: datafile
s 1.1
v [12.2, 20.6]
s 0.8
v [14.1, 19.2]
s 0.95
v [10.3, 19.7]
s 0.91
v [8.2, 21.0]

Here each line consists of a key followed by a new random sample for that key. Lines that begin with # are ignored. The file is read using:

>>> dset = Dataset('datafile')
>>> print(dset['s'])
[ 1.1, 0.8, 0.95, 0.91]

Data can be binned while reading it in, which might be useful if the data set is huge or if correlations are a concern. To bin the data contained in file datafile in bins of bin size 2 we use:

>>> dset = Dataset('datafile', binsize=2)
>>> print(dset['s'])
[0.95, 0.93]

The keys read from a data file are restricted to those listed in keyword keys and those that are matched (or partially matched) by regular expression grep if one or other of these is specified: for example,

>>> dset = Dataset('datafile')
>>> print([k for k in dset])
['s', 'v']
>>> dset = Dataset('datafile', keys=['v'])
>>> print([k for k in dset])
['v']
>>> dset = Dataset('datafile', grep='[^v]')
>>> print([k for k in dset])
['s']
>>> dset = Dataset('datafile', keys=['v'], grep='[^v]')
>>> print([k for k in dset])
[]

In addition to text files, hdf5 files can also be read (provided module h5py is available): for example,

>>> dset = Dataset('datafile.h5', h5group='/mcdata')

reads the hdf5 datasets in hdf5 group '/mcdata'. An hdf5 equivalent to the text file above would contain two groups, one with key 's' that is a one-dimensional array with shape (4,), and another with key 'v' that is a two-dimensional array with shape (4, 2):

>>> import h5py
>>> for v in h5py.File('datafile.h5')['/mcdata'].values():
...     print(v)
<HDF5 dataset "s": shape (4,), type "<f8">
<HDF5 dataset "v": shape (4, 2), type "<f8">

Finally, Datasets can also be constructed from other dictionaries (including other Datasets), or lists of key-data tuples. For example,

>>> dset = Dataset('datafile')
>>> dset_binned = Dataset(dset, binsize=2)
>>> dset_v = Dataset(dset, keys=['v'])

reads data from file 'datafile' into Dataset dset, and then creates a new Dataset with the data binned (dset_binned), and another that only contains the data with key 'v' (dset_v).

Parameters:
  • inputdata (str or list or dictionary) –

    If inputdata is a string, it is the name of a file containing datasets. Two formats are supported. If the filename ends in ‘.h5’, the file is in hdf5 format, with datasets that are numpy arrays whose first index labels the random sample.

    The other file format is a text file where each line consists of a key followed by a number or array of numbers representing a new random sample associated with that key. Lines beginning with # are comments. A list of text file names can also be supplied, and text files can be compressed (with names ending in .gz or .bz2).

    If inputdata is a dictionary or a list of (key,value) tuples, its keys and values are copied into the dataset. Its values should be arrays whose first index labels the random sample.

  • binsize (int) – Bin the random samples in bins of size binsize. Default value is binsize=1 (i.e., no binning).

  • grep (str or None) – If not None, only keys that match or partially match regular expression grep are retained in the data set. Keys that don’t match are ignored. Default is grep=None.

  • keys (list) – List of keys to retain in data set. Keys that are not in the list are ignored. Default is keys=None which implies that all keys are kept.

  • h5group (str or list) – Address within the hdf5 file identified by inputdata that contains the relevant datasets. Every hdf5 dataset in group h5group is read into the dataset, with the same key as in h5group. Default is the top group in the file: h5group='/'. h5group can also be a list of groups, in which case datasets from all of the groups are read.

The main attributes and methods are:

samplesize

Smallest number of samples for any key.

append(*args, **kargs)

Append data to dataset.

There are three equivalent ways of adding data to a dataset data: for example, each of

data.append(n=1.739,a=[0.494,2.734])        # method 1

data.append(n,1.739)                        # method 2
data.append(a,[0.494,2.734])

dd = dict(n=1.739,a=[0.494,2.734])          # method 3
data.append(dd)

adds one new random number to data['n'], and a new vector to data['a'].

extend(*args, **kargs)

Add batched data to dataset.

There are three equivalent ways of adding batched data, containing multiple samples for each quantity, to a dataset data: for example, each of

data.extend(n=[1.739,2.682],
            a=[[0.494,2.734],[ 0.172, 1.400]])  # method 1

data.extend(n,[1.739,2.682])                    # method 2
data.extend(a,[[0.494,2.734],[ 0.172, 1.400]])

dd = dict(n=[1.739,2.682],
            a=[[0.494,2.734],[ 0.172, 1.400]])  # method 3
data.extend(dd)

adds two new random numbers to data['n'], and two new random vectors to data['a'].

This method can be used to merge two datasets, whether or not they share keys: for example,

data = Dataset("file1")
data_extra = Dataset("file2")
data.extend(data_extra)   # data now contains all of data_extra
grep(rexp)

Create new dataset containing items whose keys match rexp.

Returns a new gvar.dataset.Dataset` containing only the items self[k] whose keys k match regular expression rexp (a string) according to Python module re:

>>> a = Dataset()
>>> a.append(xx=1.,xy=[10.,100.])
>>> a.append(xx=2.,xy=[20.,200.])
>>> print(a.grep('y'))
{'yy': [array([  10.,  100.]), array([  20.,  200.])]}
>>> print(a.grep('x'))
{'xx': [1.0, 2.0], 'xy': [array([  10.,  100.]), array([  20.,  200.])]}
>>> print(a.grep('x|y'))
{'xx': [1.0, 2.0], 'xy': [array([  10.,  100.]), array([  20.,  200.])]}
>>> print a.grep('[^y][^x]')
{'xy': [array([  10.,  100.]), array([  20.,  200.])]}

Items are retained even if rexp matches only part of the item’s key.

slice(sl)

Create new dataset with self[k] -> self[k][sl].

Parameter sl is a slice object that is applied to every item in the dataset to produce a new gvar.Dataset. Setting sl = slice(0,None,2), for example, discards every other sample for each quantity in the dataset. Setting sl = slice(100,None) discards the first 100 samples for each quantity.

If parameter sl is a tuple of slice objects, these are applied to successive indices of self[k]. An exception is called if the number of slice objects exceeds the number of dimensions for any self[k].

arrayzip(template)

Merge lists of random data according to template.

template is an array of keys in the dataset, where the shapes of self[k] are the same for all keys k in template. self.arrayzip(template) merges the lists of random numbers/arrays associated with these keys to create a new list of (merged) random arrays whose layout is specified by template: for example,

>>> d = Dataset()
>>> d.append(a=1,b=10)
>>> d.append(a=2,b=20)
>>> d.append(a=3,b=30)
>>> print(d)            # three random samples each for a and b
{'a': [1.0, 2.0, 3.0], 'b': [10.0, 20.0, 30.0]}
>>> # merge into list of 2-vectors:
>>> print(d.arrayzip(['a','b']))
[[  1.  10.]
 [  2.  20.]
 [  3.  30.]]
>>> # merge into list of (symmetric) 2x2 matrices:
>>> print(d.arrayzip([['b','a'],['a','b']]))
[[[ 10.   1.]
  [  1.  10.]]

 [[ 20.   2.]
  [  2.  20.]]

 [[ 30.   3.]
  [  3.  30.]]]

The number of samples in each merged result is the same as the number samples for each key (here 3). The keys used in this example represent scalar quantities; in general, they could be either scalars or arrays (of any shape, so long as all have the same shape).

trim()

Create new dataset where all entries have same sample size.

toarray()

Create new dictionary d where d[k]=numpy.array(self[k]) for all k.

class gvar.dataset.svd_diagnosis(dataset, nbstrap=50, mincut=1e-12, models=None, process_dataset=None)

Diagnose the need for an SVD cut.

gvar.dataset.svd_diagnosis bootstraps the spectrum of the correlation matrix for the data in dataset to determine how much of that spectrum is reliably determined by this data.

Here dataset is a list of random arrays or a dictionary (e.g., gvar.dataset.Dataset) whose values are lists of random numbers or random arrays. The random numbers or arrays are averaged (using gvar.dataset.avg_data()) to produce a set gvar.GVars and their correlation matrix. The smallest eigenvalues of the correlation matrix are poorly estimated when the number of random samples is insufficiently large — the number of samples should typically be significantly larger than the number of random variables being analyzed in order to get good estimates of the correlations between these variables.

Typical usage is

import gvar as gv

s = gv.dataset.svd_diagnosis(dataset)
avgdata = gv.svd(s.avgdata, svdcut=s.svdcut)
s.plot_ratio(show=True)

where the defective part of the correlation matrix is corrected by applying an SVD cut to the averaged data. A plot showing the ratio of bootstrapped eigenvalues to the actual eigenvalues is displayed by the s.plot_ratio command.

Parameters:
  • dataset – List of random arrays or a dictionary (e.g., gvar.dataset.Dataset) whose values are lists of random numbers or random arrays. Alternatively it can be a tuple (g, Ns) where: g is an array of gvar.GVars or a dictionary whose values are gvar.GVars or arrays of gvar.GVars; and Ns is the number of random samples. Then the list of random data that is analyzed is created is created using gvar.raniter(g, n=Ns).

  • nbstrap – Number of bootstrap copies used (default is 50).

  • models – For use in conjunction with lsqfit.MultiFitter; ignored when not specified. When specified, it is a list of multi- fitter models used to specify which parts of the data are being analyzed. The correlation matrix is restricted to the data specified by the models and the data returned are “processed data” for use with a multi-fitter using keyword pdata rather than data. Ignored if keyword process_datasets is specified.

  • process_dataset – Function that converts datasets into averaged data. Function gvar.dataset.avg_data() is used if set equal to None (default).

  • mincut – Minimum SVD cut (default 1e-12).

The main attributes are:

svdcut

SVD cut for bad eigenvalues in correlation matrix.

eps

eps corresponding to svdcut, for use in gvar.regulate().

avgdata

Averaged data (gvar.dataset.avg_data(dataset)).

val

Eigenvalues of the correlation matrix.

bsval

Bootstrap average of correlation matrix eigenvalues.

nmod

Number of eigenmodes modified by SVD cut svdcut.

A method is available to display the eigenvalues:

plot_ratio(plot=None, show=False)

Plot ratio of bootstrapped eigenvalues divided by actual eigenvalues.

Ratios (blue points) are plotted versus the value of the actual eigenvalues divided by the maximum eigenvalue. Error bars on the ratios show the range of variation across bootstrapped copies. A dotted line is drawn at 1 - sqrt(2/N), where N is the number of data points. The proposed SVD cut is where the ratio curve intersects this line; that point is indicated by a vertical dashed red line. The plot object is returned.

Parameters:
  • plotmatplotlib plotter used to make plot. Uses plot = matplotlib.pyplot if plot=None (default).

  • show – Displays the plot if show=True (default False).