gapfill

Fill gaps of flux data from Eddy covariance measurements or estimate flux uncertainties

This module was written by Matthias Cuntz while at Department of Computational Hydrosystems, Helmholtz Centre for Environmental Research - UFZ, Leipzig, Germany, and continued while at Institut National de Recherche pour l’Agriculture, l’Alimentation et l’Environnement (INRAE), Nancy, France.

copyright:

Copyright 2012-2022 Matthias Cuntz, see AUTHORS.rst for details.

license:

MIT License, see LICENSE for details.

The following functions are provided

gapfill(dfin[, flag, date, timeformat, ...])

Fill gaps of flux data from Eddy covariance measurements or estimate flux uncertainties

History
  • Written Mar 2012 by Matthias Cuntz - mc (at) macu (dot) de

  • Ported to Python 3, Feb 2013, Matthias Cuntz

  • Input data can be ND-array, Apr 2014, Matthias Cuntz

  • Bug in longestmarginalgap: was only working at time series edges, rename it to longgap, Apr 2014, Matthias Cuntz

  • Keyword fullday, Apr 2014, Matthias Cuntz

  • Input can be pandas Dataframe or numpy array(s), Apr 2020, Matthias Cuntz

  • Using numpy docstring format, May 2020, Matthias Cuntz

  • Error estimates are undef by default, Jun 2021, Matthias Cuntz

  • Mean of values for error estimates, Jun 2021, Matthias Cuntz

  • Improved flake8 and numpy docstring, Oct 2021, Matthias Cuntz

gapfill(dfin, flag=None, date=None, timeformat='%Y-%m-%d %H:%M:%S', colhead=None, sw_dev=50.0, ta_dev=2.5, vpd_dev=5.0, longgap=60, fullday=False, undef=-9999, ddof=1, err=False, errmean=False, verbose=0)[source]

Fill gaps of flux data from Eddy covariance measurements or estimate flux uncertainties

Fills gaps in flux data from Eddy covariance measurements with Marginal Distribution Sampling (MDS) according to Reichstein et al. (Global Change Biology, 2005).

This means, if there is a gap in the data, look for similar meteorological conditions (defined as maximum possible deviations) in a certain time window and fill with the average of these ‘similar’ values.

The routine can also do the same search for similar meteorological conditions for every data point and calculate its standard deviation as a measure of uncertainty after Lasslop et al. (Biogeosciences, 2008).

Parameters:
  • dfin (pandas.Dataframe or numpy.array) –

    time series of fluxes to fill as well as meteorological variables: incoming short-wave radiation, air temperature, and air vapour pressure deficit. dfin can be a pandas.Dataframe with the columns

    ’SW_IN’ (or starting with ‘SW_IN’) for incoming short-wave radiation [W m-2]

    ’TA’ (or starting with TA_) for air temperature [deg C]

    ’VPD’ (or starting with ‘VPD’) for air vapour deficit [hPa]

    as well as columns with ecosystem fluxes with missing values (gaps).

    The index is taken as date variable.

    dfin can also me a numpy array with the same columns. In this case colhead, date, and possibly dateformat must be given.

  • flag (pandas.Dataframe or numpy.array, optional) – Dataframe or array has the same shape as dfin. Non-zero values in flag will be treated as missing values in dfin. flag must follow the same rules as dfin if pandas.Dataframe. If flag is numpy array, df.columns.values will be used as column heads and the index of dfin will be copied to flag.

  • date (array_like of string, optional) – 1D-array_like of calendar dates in format given in timeformat. date must be given if dfin is numpy array.

  • timeformat (str, optional) – Format of dates in date, if given (default: ‘%Y-%m-%d %H:%M:%S’). See strftime documentation of Python’s datetime module: https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior

  • colhed (array_like of str, optional) – column names if dfin is numpy array. See dfin for mandatory column names.

  • sw_dev (float, optional) – threshold for maximum deviation of global radiation (default: 50)

  • ta_dev (float, optional) – threshold for maximum deviation of air Temperature (default: 2.5)

  • vpd_dev (float, optional) – threshold for maximum deviation of vpd (default: 5)

  • longgap (int, optional) – avoid extrapolation into a gap longer than longgap days (default: 60)

  • fullday (bool, optional) – True: move beginning of large gap to start of next day and move end of large gap to end of last day (default: False)

  • undef (float, optional) – values having undef value are treated as missing values in dfin (default: -9999). np.nan is not allowed as undef (not working).

  • ddof (int, optional) – Delta Degrees of Freedom. The divisor used in calculation of standard deviation for error estimates (err=True) is N-ddof, where N represents the number of elements (default: 1).

  • err (bool, optional) – True: fill every data point with standard deviation instead of mean, i.e. used for error generation as in Lasslop et al. (Biogeosci 2008) (default: False)

  • errmean (bool, optional) – True: also return mean value of values for error estimates if err == True (default: False)

  • shape (bool or tuple, optional) –

    True: output have the same shape as input data if dfin is numpy array; if a tuple is given, then this tuple is used to reshape.

    False: outputs are 1D arrays if dfin is numpy array (default: False).

  • verbose (int, optional) – Verbosity level 0-3 (default: 0). 0 is no output; 3 is very verbose.

Returns:

if not err: filled_data, quality_class

if err and not errmean: err_estimate

if err and errmean: err_estimate, mean_estimate

pandas.Dataframe(s) will be returned if dfin was Dataframe.

numpy array(s) will be returned if dfin was numpy array.

Return type:

pandas.Dataframe(s) or numpy array(s)

Notes

If err, there is no error estimate if there are no meteorological conditions in the vicinity of the data point (first cycle of Reichstein et al. GCB 2005).

Routine does not work with undef=np.nan.

Reichstein et al. (2005)

On the separation of net ecosystem exchange into assimilation and ecosystem respiration: review and improved algorithm Global Change Biology 11, 1424-1439

Lasslop et al. (2008)

Influences of observation errors in eddy flux data on inverse model parameter estimation Biogeosciences, 5, 1311–1324

Examples

>>> import numpy as np
>>> from fread import fread
>>> from date2dec import date2dec
>>> from dec2date import dec2date

data

>>> ifile = 'test_gapfill.csv' # Tharandt 1998 = Online tool example file
>>> undef = -9999.
>>> dat   = fread(ifile, skip=2, transpose=True)
>>> ndat  = dat.shape[1]
>>> head  = fread(ifile, skip=2, header=True)
>>> head1 = head[0]
>>> idx   = []
>>> for i in head1:
...     if i in ['NEE', 'LE', 'H', 'Rg', 'Tair', 'VPD']:
...         idx.append(head1.index(i))
>>> colhead = ['FC', 'LE', 'H', 'SW_IN', 'TA', 'VPD']
>>> dfin = dat[idx,:]

flag

>>> flag = np.where(dfin == undef, 2, 0)
>>> flag[0, :] = dat[head1.index('qcNEE'), :].astype(int)
>>> flag[1, :] = dat[head1.index('qcLE'), :].astype(int)
>>> flag[2, :] = dat[head1.index('qcH'), :].astype(int)
>>> flag[np.where(flag==1)] = 0

date

>>> day_id  = head1.index('Day')
>>> hour_id = head1.index('Hour')
>>> ntime   = dat.shape[1]
>>> year  = np.ones(ntime, dtype=int) * 1998
>>> hh    = dat[hour_id, :].astype(int)
>>> mn    = np.rint((dat[hour_id,:] - hh) * 60.).astype(int)
>>> y0    = date2dec(yr=year[0], mo=1, dy=1, hr=hh, mi=mn)
>>> jdate = y0 + dat[day_id, :]
>>> adate = dec2date(jdate, eng=True)

fill missing data

>>> dat_f, flag_f = gapfill(dfin, flag=flag, date=adate, colhead=colhead,
...                         undef=undef, verbose=0)
>>> print('{:d} {:d} {:d} {:d} {:d} {:d}'.format(*flag_f[0, 11006:11012]))
1 1 1 2 2 2
>>> print('{:.2f} {:.2f} {:.2f} {:.2f} {:.2f} {:.2f}'.format(
...       *dat_f[0, 11006:11012]))
-18.68 -15.63 -19.61 -15.54 -12.40 -15.33

1D err

>>> dat_std = gapfill(dfin, flag=flag, date=adate, colhead=colhead,
...                   undef=undef, verbose=0, err=True)
>>> print('{:.3f} {:.3f} {:.3f} {:.3f} {:.3f} {:.3f}'.format(
...       *dat_std[0, 11006:11012]))
5.372 13.118 6.477 -9999.000 -9999.000 -9999.000
>>> dat_err = np.ones(ndat, dtype=int)*(-1)
>>> kk      = np.where((dat_std[0, :] != undef) & (dat_f[0, :] != 0.))[0]
>>> dat_err[kk] = np.abs(dat_std[0,kk]/dat_f[0,kk]*100.).astype(int)
>>> print('{:d} {:d} {:d} {:d} {:d} {:d}'.format(*dat_err[11006:11012]))
28 83 33 -1 -1 -1

1D err + mean

>>> dat_std, dat_mean = gapfill(dfin, flag=flag, date=adate,
...                             colhead=colhead, undef=undef, verbose=0,
...                             err=True, errmean=True)
>>> print('{:.3f} {:.3f} {:.3f} {:.3f} {:.3f} {:.3f}'.format(
...       *dat_std[0, 11006:11012]))
5.372 13.118 6.477 -9999.000 -9999.000 -9999.000
>>> print('{:.3f} {:.3f} {:.3f} {:.3f} {:.3f} {:.3f}'.format(
...       *dat_mean[0, 11006:11012]))
-18.677 -15.633 -19.610 -9999.000 -9999.000 -9999.000