Data loader module

DataCollection

class mdgru.data.DataCollection(kw)[source]

Bases: object

Abstract class for all data handling classes.

Parameters:kw (dict containing the following options.) –
  • seed [default: 1234] Seed to be used for deterministic random sampling, given no threading is used
  • nclasses [default: None]
_defaults = {'nclasses': None, 'seed': {'help': 'Seed to be used for deterministic random sampling, given no threading is used', 'value': 1234}}
_one_hot_vectorize(indexlabels, nclasses=None, zero_out_label=None)[source]

simplified onehotlabels method. we discourage using interpolated labels anyways, hence this only allows integer values in indexlabels

Parameters:
  • indexlabels (ndarray) – array containing labels or indices for each class, starting at 0 until nclasses-1
  • nclasses (int) – number of classes
  • zero_out_label (int) – label to assign probability of zero for the whole probability distribution
Returns:

ndarray – Probabilitydistributions per pixel where at position indexlabels the value is set to 1, otherwise to 0

static get_all_tps(folder, featurefiles, maskfiles)[source]

computes list of all folders that are subfolders of folder and contain all provided featurefiles and maskfiles.

Parameters:
  • folder (str) – location at which timepoints are searched
  • featurefiles (list of str) – necessary featurefiles to be contained in a timepoint
  • maskfiles (list of str) – necessary maskfiles to be contained in a timepoint
Returns:

sorted list – valid timepoints in string format

get_data_dims()[source]

Returns the dimensionality of the whole collection (even if samples are returned/computed on the fly, the theoretical size is returned). Has between two and three entries (Depending on the type of data. A dataset with sequence of vectors has 3, a dataset with sequences of indices has two, etc)

Returns:list – A shape array of the dimensionality of the data.
get_shape()[source]
get_states()[source]

Get states of this data collection

random_sample(**kw)[source]

Randomly samples from our dataset. If the implementation knows different datasets, the dataset string can be used to choose one, if not, it will be ignored.

Parameters:**kw (keyword args) – batch_size can be set, amongst other parameters. See implementing methods for more detail.
Returns:array – A random sample of length batch_size.
reset_seed(seed=12345678)[source]

reset main random number generator with given seed

set_states(state)[source]

reset random state generators given the states in “states”

Parameters:states (object) – Random generator state

GridDataCollection

class mdgru.data.grid_collection.GridDataCollection(w, p, location=None, tps=None, kw={})[source]

Bases: mdgru.data.DataCollection

Parameters:
  • kw (dict containing the following options.) –
    • featurefiles Filenames of featurefiles.
    • maskfiles [default: []] Filenames of mask file(s) to be used as reference
    • subtractGaussSigma [default: [5]] Standard deviations to use for gaussian filtered image during highpass filtering data augmentation step. No arguments deactivates the feature. Can have 1 or nfeatures entries
    • nooriginal [default: False] Do not use original data, only gauss filtered
    • correct_orientation [default: True] Do not correct for the nifti orientation (for example, if header information cannot be trusted but all data arrays are correctly aligned
    • deform [default: [0]] Deformation grid spacing in pixels. If zero, no deformation will be applied
    • deformSigma [default: [0]] Given a deformation grid spacing, this determines the standard deviations for each dimension of the random deformation vectors.
    • mirror [default: [0]] Activate random mirroring along the specified axes during training
    • gaussiannoise [default: False] Random multiplicative Gaussian noise on the input data with given std and mean 1
    • scaling [default: [0]] Amount ot randomly scale images, per dimension, or for all dimensions, as a factor (e.g. 1.25)
    • rotation [default: 0] Amount in radians to randomly rotate the input around a randomly drawn vector
    • shift [default: [0]] In order to sample outside of discrete coordinates, this can be set to 1 on the relevant axes
    • vary_mean [default: 0]
    • vary_stddev [default: 0]
    • interpolate_always [default: False] Should we also interpolate when using no deformation grids (forces to use same pathways).
    • deformseed [default: 1234] defines the random seed used for the deformation variables
    • interpolation_order [default: 3] Spline order interpolation. Values lower than 3 are: 0: nearest, 1: linear, 2: cubic.
    • padding_rule [default: constant] Rule on how to add values outside the image boundaries. options are: (‘constant’, ‘nearest’, ‘reflect’ or ‘wrap’
    • regression [default: False]
    • softlabels [default: False]
    • whiten [default: True] Dont whiten data to mean 0 and std 1.
    • whiten_subvolumes [default: False] Whiten subvolumes to mean 0 and std 1 (usually it makes more sense to do so on whole volumes)
    • each_with_labels [default: 0] Force each n-th sample to contain labelled data
    • presize_for_normalization [default: [None]] Supply fixed sizes for the calculation of mean and stddev (only suitable with option whiten set)
    • half_gaussian_clip [default: False]
    • pyramid_sampling [default: False]
    • choose_mask_at_random [default: False] if multiple masks are provided, we select one at random for each sample
    • zero_out_label [default: None]
    • lazy [default: True] Do not load values lazily
    • perform_one_hot_encoding [default: True] Do not one hot encode target
    • minlabel [default: 1] Minimum label to count for each_with_label functionality
    • channels_first [default: False]
    • preloadall [default: False]
    • truncated_deform [default: False] deformations with displacements of maximum 3 times gausssigma in each spatial direction
    • connected_components [default: False] return labels of connected components for each pixel belonging to a component instead of its label. Only works for binary segmentation and if no one hot encoding is used (with pytorch).
    • ignore_missing_mask [default: False]
    • save_as [default: None] determines the format of the output images / volumes. Must be either .nii.gz for nifti format, .mhd for MHD output, .png .jpeg or any other common 2d image type in the case of 2d data and .raw for a simple data dump. By default it will try to infer the format, otherwise it will use the nifti format as .nii.gz.
  • w (list) – subvolume/patchsize
  • p (list) – amount of padding per dimension.
  • location (str, optional) – Root folder where samples defined by featurefiles and maskfiles lie. Needs to be provided if tps is not.
  • tps (list, optional) – List of locations or samples defined by featurefiles and maskfiles. Needs to be provided if location is not.
_defaults = {'channels_first': False, 'choose_mask_at_random': {'value': False, 'help': 'if multiple masks are provided, we select one at random for each sample'}, 'connected_components': {'value': False, 'help': 'return labels of connected components for each pixel belonging to a component instead of its label. Only works for binary segmentation and if no one hot encoding is used (with pytorch).'}, 'correct_orientation': {'value': True, 'invert_meaning': 'dont_', 'help': 'Do not correct for the nifti orientation (for example, if header information cannot be trusted but all data arrays are correctly aligned'}, 'deform': {'value': [0], 'help': 'Deformation grid spacing in pixels. If zero, no deformation will be applied', 'type': <class 'int'>}, 'deformSigma': {'value': [0], 'help': 'Given a deformation grid spacing, this determines the standard deviations for each dimension of the random deformation vectors.', 'type': <class 'float'>}, 'deformseed': {'value': 1234, 'help': 'defines the random seed used for the deformation variables', 'type': <class 'int'>}, 'each_with_labels': {'value': 0, 'type': <class 'int'>, 'help': 'Force each n-th sample to contain labelled data'}, 'featurefiles': {'help': 'Filenames of featurefiles.', 'nargs': '+', 'short': 'f'}, 'gaussiannoise': {'value': False, 'help': 'Random multiplicative Gaussian noise on the input data with given std and mean 1'}, 'half_gaussian_clip': False, 'ignore_missing_mask': False, 'interpolate_always': {'value': False, 'help': 'Should we also interpolate when using no deformation grids (forces to use same pathways).'}, 'interpolation_order': {'value': 3, 'help': 'Spline order interpolation. Values lower than 3 are: 0: nearest, 1: linear, 2: cubic.'}, 'lazy': {'value': True, 'help': 'Do not load values lazily', 'invert_meaning': 'non'}, 'maskfiles': {'value': [], 'help': 'Filenames of mask file(s) to be used as reference', 'short': 'm', 'nargs': '+'}, 'minlabel': {'value': 1, 'type': <class 'int'>, 'help': 'Minimum label to count for each_with_label functionality'}, 'mirror': {'value': [0], 'help': 'Activate random mirroring along the specified axes during training', 'type': <class 'bool'>}, 'nooriginal': {'value': False, 'help': 'Do not use original data, only gauss filtered'}, 'padding_rule': {'value': 'constant', 'help': 'Rule on how to add values outside the image boundaries. options are: (‘constant’, ‘nearest’, ‘reflect’ or ‘wrap’'}, 'perform_one_hot_encoding': {'value': True, 'help': 'Do not one hot encode target', 'invert_meaning': 'dont_'}, 'preloadall': False, 'presize_for_normalization': {'value': [None], 'help': 'Supply fixed sizes for the calculation of mean and stddev (only suitable with option whiten set)'}, 'pyramid_sampling': False, 'regression': False, 'rotation': {'value': 0, 'help': 'Amount in radians to randomly rotate the input around a randomly drawn vector', 'type': <class 'float'>}, 'save_as': {'value': None, 'help': 'determines the format of the output images / volumes. Must be either .nii.gz for nifti format, .mhd for MHD output, .png .jpeg or any other common 2d image type in the case of 2d data and .raw for a simple data dump. By default it will try to infer the format, otherwise it will use the nifti format as .nii.gz.'}, 'scaling': {'value': [0], 'help': 'Amount ot randomly scale images, per dimension, or for all dimensions, as a factor (e.g. 1.25)', 'type': <class 'float'>}, 'shift': {'value': [0], 'help': 'In order to sample outside of discrete coordinates, this can be set to 1 on the relevant axes', 'type': <class 'float'>}, 'softlabels': False, 'subtractGaussSigma': {'value': [5], 'type': <class 'int'>, 'help': 'Standard deviations to use for gaussian filtered image during highpass filtering data augmentation step. No arguments deactivates the feature. Can have 1 or nfeatures entries', 'nargs': '*'}, 'truncated_deform': {'value': False, 'help': 'deformations with displacements of maximum 3 times gausssigma in each spatial direction'}, 'vary_mean': 0, 'vary_stddev': 0, 'whiten': {'value': True, 'invert_meaning': 'dont_', 'help': 'Dont whiten data to mean 0 and std 1.'}, 'whiten_subvolumes': {'value': False, 'help': 'Whiten subvolumes to mean 0 and std 1 (usually it makes more sense to do so on whole volumes)'}, 'zero_out_label': None}
_extract_sample(features, masks, imin, imax, shapev, needslabels=False, one_hot=True)[source]

Returns for one sample in the batch the extracted features and mask(s). the required output has shape [wx,wy,wz,f],[wx,wy,wz,c] with wxyz being the subvolumesize and f,c the features and classes respectively. Use onehot in here if onehot is used, optionally, if at all no one hot vector encoded data is used, the flag can be set to False

Parameters:
  • features (ndarray) – input data of full image
  • masks (ndarray) – respective label maps for the full sample / patient / timepoint
  • imin (list) – list of starting indices per dimension for the subvolume / patch to be extracted
  • imax (list) – list of stopping indices per dimension for the subvolume / patch to be extracted
  • shapev (list) – list defining the shape of features and mask
  • needslabels (bool) – If set, will return an empty list if no labels are defined in the resulting subvolume / patch, forcing the calling method to call extract sample on a new location
  • one_hot (bool) – Defines if we return the label data as one hot vectors per voxel / pixel or as label index per voxel / pixel
Returns:

  • tuple of extracted data and labels corresponding to the patch / subvolume defined above and the chosen data
  • augmentation scheme

_get_deform_field_dm()[source]

Helper function to get deformation field. First we define a low resolution deformation field, where we sample randomly from $N(0,I deformSigma)$ at each point in the grid. We then use cubic interpolation to upsample the deformation field to our resolution.

Returns:Deformation field which will be applied to the regular sampling coordinate ndarray
_get_features_and_masks(folder, featurefiles=None, maskfiles=None)[source]

Returns for sample in folder all feature and mask files :param folder: location of sample :type folder: str :param featurefiles: featurefiles to return :type featurefiles: list of str, optional :param maskfiles: maskfiles to return :type maskfiles: list of str, optional

Returns:tuple of feature and mask ndarrays
_one_hot_vectorize(indexlabels, nclasses=None, zero_out_label=None)

simplified onehotlabels method. we discourage using interpolated labels anyways, hence this only allows integer values in indexlabels

Parameters:
  • indexlabels (ndarray) – array containing labels or indices for each class, starting at 0 until nclasses-1
  • nclasses (int) – number of classes
  • zero_out_label (int) – label to assign probability of zero for the whole probability distribution
Returns:

ndarray – Probabilitydistributions per pixel where at position indexlabels the value is set to 1, otherwise to 0

_rotate(affine)[source]

Helper function to rotate an affine matrix

affine = array([[1., 0., 0., 0.], [0., 1., 0., 0.], [0., 0., 1., 0.], [0., 0., 0., 1.]])
get_all_tps(folder, featurefiles, maskfiles)

computes list of all folders that are subfolders of folder and contain all provided featurefiles and maskfiles.

Parameters:
  • folder (str) – location at which timepoints are searched
  • featurefiles (list of str) – necessary featurefiles to be contained in a timepoint
  • maskfiles (list of str) – necessary maskfiles to be contained in a timepoint
Returns:

sorted list – valid timepoints in string format

get_data_dims()[source]

Returns the shape of all available data concatenated in the batch dimension

Returns:list (shape of all input data)
get_shape()[source]

Returns the shape of the input data (with the batchsize set to None)

Returns:list (shape of input data)
get_states()[source]

Get the states of all involved random generators

Returns:states of random generators
get_target_shape()[source]

Returns the shape of the target data

Returns:list (shape of target data)
get_volume_batch_generators()[source]

Helper method returning a generator to efficiently fully sample a test volume on a predefined grid given w and p

Returns:Generator which completely covers the data for each sample in tps in a way defined by w and p
labellist = None
load(file, lazy=True)[source]

Handles all data loading from disk. If new filetypes should be allowed, this has to be implemented here.

Parameters:
  • file (str) – file path of the image / volume to load or folder of the images to load as volume.
  • lazy (bool) – If set to False, all files are kept in memory once they are loaded.
Returns:

image data

pixdim = array([1, 1, 1, 1, 1, 1, 1])
preload_all()[source]

Greedily loads all images into memory

random_sample(batch_size=1, dtype=None, tp=None, **kw)[source]

Randomly samples batch_size times from the data, using data augmentation if specified when creating the class

Parameters:
  • batch_size (number of samples to draw) –
  • dtype (datatype to return) –
  • tp (specific timepoint / patient to sample from) –
  • kw (options (not used)) –
Returns:

tuple of samples and corresponding label masks

reset_seed(seed=12345678)

reset main random number generator with given seed

save(data, filename, tporigin=None)[source]

Saves image in data at location filename. Currently, data can only be saved as nifti or png images.

Parameters:
  • data (ndarray containing the image data) –
  • filename (location to store the image) –
  • tporigin (used, if the data needs to be stored in the same orientation as the data at tporigin. Only works for nifti files) –
set_states(states)[source]

Sets states of random generators according to the states in states

Parameters:states (random generator states) –
subtract_gauss(data)[source]

Subtracts gaussian filtered data from itself

Parameters:data (ndarray) – data to preprocess
Returns:ndarray (gaussian filtered data)
transformAffine(coords)[source]

Transforms coordinates according to the specified data augmentation scheme

Parameters:coords (ndarray) – original, not augmented pixel coordinates of the subvolume / patch
Returns:“augmented” ndarray of coords
w = [64, 64]
class mdgru.data.grid_collection.ThreadedGridDataCollection(featurefiles, maskfiles=[], location=None, tps=None, kw={})[source]

Bases: mdgru.data.grid_collection.GridDataCollection

Threaded version of GridDataCollection. Basically a thin wrapper which employs num_threads threads to preload random samples. This will however result in possibly nonreproducible sampling patterns, as the threads run concurrently.

Parameters:
  • kw (dict containing the following options.) –
    • batch_size [default: 1]
    • num_threads [default: 3] Determines how many threads are used to prefetch data, such that io operations do not cause delay.
  • featurefiles (list of str) – filenames of the different features to consider
  • maskfiles (list of str) – filenames of the available mask files per patient
  • location (str, optional) – Location at which all samples containing all of featurefiles and maskfiles lie somewhere in the subfolder structure. Must be provided, if tps is not.
  • tps (paths of all samples to consider. must be provided if location is not set.) –
_defaults = {'batch_size': 1, 'num_threads': {'help': 'Determines how many threads are used to prefetch data, such that io operations do not cause delay.', 'value': 3, 'type': <class 'int'>}}
_extract_sample(features, masks, imin, imax, shapev, needslabels=False, one_hot=True)

Returns for one sample in the batch the extracted features and mask(s). the required output has shape [wx,wy,wz,f],[wx,wy,wz,c] with wxyz being the subvolumesize and f,c the features and classes respectively. Use onehot in here if onehot is used, optionally, if at all no one hot vector encoded data is used, the flag can be set to False

Parameters:
  • features (ndarray) – input data of full image
  • masks (ndarray) – respective label maps for the full sample / patient / timepoint
  • imin (list) – list of starting indices per dimension for the subvolume / patch to be extracted
  • imax (list) – list of stopping indices per dimension for the subvolume / patch to be extracted
  • shapev (list) – list defining the shape of features and mask
  • needslabels (bool) – If set, will return an empty list if no labels are defined in the resulting subvolume / patch, forcing the calling method to call extract sample on a new location
  • one_hot (bool) – Defines if we return the label data as one hot vectors per voxel / pixel or as label index per voxel / pixel
Returns:

  • tuple of extracted data and labels corresponding to the patch / subvolume defined above and the chosen data
  • augmentation scheme

_get_deform_field_dm()

Helper function to get deformation field. First we define a low resolution deformation field, where we sample randomly from $N(0,I deformSigma)$ at each point in the grid. We then use cubic interpolation to upsample the deformation field to our resolution.

Returns:Deformation field which will be applied to the regular sampling coordinate ndarray
_get_features_and_masks(folder, featurefiles=None, maskfiles=None)

Returns for sample in folder all feature and mask files :param folder: location of sample :type folder: str :param featurefiles: featurefiles to return :type featurefiles: list of str, optional :param maskfiles: maskfiles to return :type maskfiles: list of str, optional

Returns:tuple of feature and mask ndarrays
_one_hot_vectorize(indexlabels, nclasses=None, zero_out_label=None)

simplified onehotlabels method. we discourage using interpolated labels anyways, hence this only allows integer values in indexlabels

Parameters:
  • indexlabels (ndarray) – array containing labels or indices for each class, starting at 0 until nclasses-1
  • nclasses (int) – number of classes
  • zero_out_label (int) – label to assign probability of zero for the whole probability distribution
Returns:

ndarray – Probabilitydistributions per pixel where at position indexlabels the value is set to 1, otherwise to 0

_preload_random_sample(batchsize, container_id)[source]
_rotate(affine)

Helper function to rotate an affine matrix

affine = array([[1., 0., 0., 0.], [0., 1., 0., 0.], [0., 0., 1., 0.], [0., 0., 0., 1.]])
get_all_tps(folder, featurefiles, maskfiles)

computes list of all folders that are subfolders of folder and contain all provided featurefiles and maskfiles.

Parameters:
  • folder (str) – location at which timepoints are searched
  • featurefiles (list of str) – necessary featurefiles to be contained in a timepoint
  • maskfiles (list of str) – necessary maskfiles to be contained in a timepoint
Returns:

sorted list – valid timepoints in string format

get_data_dims()

Returns the shape of all available data concatenated in the batch dimension

Returns:list (shape of all input data)
get_shape()

Returns the shape of the input data (with the batchsize set to None)

Returns:list (shape of input data)
get_states()

Get the states of all involved random generators

Returns:states of random generators
get_target_shape()

Returns the shape of the target data

Returns:list (shape of target data)
get_volume_batch_generators()

Helper method returning a generator to efficiently fully sample a test volume on a predefined grid given w and p

Returns:Generator which completely covers the data for each sample in tps in a way defined by w and p
labellist = None
load(file, lazy=True)

Handles all data loading from disk. If new filetypes should be allowed, this has to be implemented here.

Parameters:
  • file (str) – file path of the image / volume to load or folder of the images to load as volume.
  • lazy (bool) – If set to False, all files are kept in memory once they are loaded.
Returns:

image data

pixdim = array([1, 1, 1, 1, 1, 1, 1])
preload_all()

Greedily loads all images into memory

random_sample(batch_size=1, dtype=None, tp=None, **kw)[source]

Thin wrapper of GridDataCollections random sample, handling multiple threads to do the heavy lifting

Parameters:
  • batch_size (int) – batch_size. if this value is different to the one provided previously to the threads, data is discarded and new samples are computed adhering to the new batchsize.
  • dtype – dtype of the returned input data
  • tp (str) – specific timepoint to load
  • kw (options, not used at the moment) –
Returns:

Tuple of randomly sampled and possibly deformed / augmented input data and corresponding labels

reset_seed(seed=12345678)

reset main random number generator with given seed

save(data, filename, tporigin=None)

Saves image in data at location filename. Currently, data can only be saved as nifti or png images.

Parameters:
  • data (ndarray containing the image data) –
  • filename (location to store the image) –
  • tporigin (used, if the data needs to be stored in the same orientation as the data at tporigin. Only works for nifti files) –
set_states(states)

Sets states of random generators according to the states in states

Parameters:states (random generator states) –
subtract_gauss(data)

Subtracts gaussian filtered data from itself

Parameters:data (ndarray) – data to preprocess
Returns:ndarray (gaussian filtered data)
transformAffine(coords)

Transforms coordinates according to the specified data augmentation scheme

Parameters:coords (ndarray) – original, not augmented pixel coordinates of the subvolume / patch
Returns:“augmented” ndarray of coords
w = [64, 64]