tsa

The tsa (short for time series analysis) module is centered around the time_series class. One or more time_series objects should be central to any data analysis task that examines temporal relationships in data sets of raster or tabular format. This module also houses the rast_series class, which is an extension of time_series for handling filepaths to raster data.

Examples

time_series basics

Users with text data, or sequential raster images with identical spatial extents may wish to perform time specific operations on that data set. A time_series object may be easily used to perform tasks like subsetting, plotting, sorting, taking statistics, interpolating between values, sanitizing data for bad entries, and more.

This use case below is an example of using weather data downloaded from this NOAA website. Firstly, take a look at the format of this sample weather data, and take special note of the column labeled "YR--MODAHRMN" and the format of it.

We want to parse this weather data, and perform a variety of manipulations on it, but first we have to get it into python. To see how this is done, open up your preferred interpreter (IDLE by default) and retype this code step by step.

from dnppy import textio

filepath = "test_data/weather_dat.txt"       # define the text filepath
tdo      = textio.read_DS3505(filepath)      # build a text data object

print(tdo.headers)                           # print the headers
print(tdo.row_data[0])                       # print the first row

We now have text_data object. To read more about text_data objects, check out the textio module. To turn this tdo into a time_series, we can do

from dnppy import tsa                        # import the tsa module

ts = tsa.time_series('weather_data')
ts.from_tdo(tdo)                             # use contents of the tdo

print(ts.headers)                            # print the headers
print(ts.row_data[0])                        # print the first row

We can see that similar headers and the same row data can be found in the time_series object ts. Next we need to tell python how to interpret this data to assign times to each row. dnppy does this with python datetime objects from the native datetime module. The actual operations are hidden to the user, but we need to use datetime syntax to tell it how our dates are formatted like so:

timecol = "YR--MODAHRMN"                        # time data column header
fmt     = "%Y%m%d%H%M"                          # datetime format

ts.define_time(timecol, fmt)                    # interpret strings
ts.interogate()                                 # print a heads up summary

As we can see above, we are telling the time_series object ts to execute its internal define_time method with arguments specifying which column contains the time information, and the fmt string to use to interpret it. This specific fmt string is used to read strings such as 201307180054, of format YYYYMMDDHHMMSS. read more on datetime formatting syntax

Now we can do cool stuff, like split it into subsets.

ts.make_subsets("%d")     # subset the data into daily chunks
ts.interogate()           # print a heads up summary

As you can see from the report, there are now many “subsets” of this time series. Each of these subsets is actually its own time_series object! and all the same manipulations can be performed on them individually as can be performed on the whole series. Lets go ahead and do a quick plot of the temperature data in here.

ts.column_plot("TEMP")  # no frills plot of temperature

notice that we used the column name for temperature data from our weather file. This doesn’t make very pretty plots, so if we want, we can rename parameters. Lets say july 21st is the most interesting day and we want to see temperature and dewpoint next to eachtother, but we also want our plot to be a little prettier.

ts.rename_header("TEMP","Temperature")          # give header better name
ts.rename_header("DEWP","Dewpoint")             # give header better name

jul21 = ts["2013-07-21"]                        # pull out subset july 21st

jul21.column_plot(["Temperature","Dewpoint"],   # make plot with labels
        title = "Temperature and Dewpoint",
        xlabel = "Date and Time",
        ylabel = "Degrees F")

This is much better, but now we decide that we don’t just want july 21st, but also the days on either side of it. To do this, we can use an overlap_width in our subsetting command to make more of a moving window through this time series, centered around july 21st.

ts.make_subsets("%d", overlap_width = 1,       # subset with 1 overlap
                discard_old = True)            # discard the old subsets

ts.interogate()                                # print a heads up summary

jul21 = ts["2013-07-21"]                       # pull july 21st subset
jul21.column_plot(["Temperature","Dewpoint"],  # save the plot this time
        title = "Temperature and Dewpoint",
        ylabel = "Degrees F",
        save_path = "test.png")

Now we actually have three days in our jul21 time series. And we are happy with this. So far, this has only introduced you to about a third of the functionality available in the tsa module, but it should be enough to get you started. Consult the internal help docs and function list below to learn more!

rast_series basics

The rast_series class is a child of time_series class, where each row in rast_series.row_data contains the values [filepath, filename, fmt_name] for a raster image.

There are some cool functions in the raster module that are tailored to time series type raster images, and the rast_series is a type of container to manage your raster data and pass it into more complex raster functions. This example is going to use example data following on from the MODIS extracting and mosaicing example

A rast_series object can be created and populated with

from dnppy import tsa
rs = tsa.rast_series()

rastdir  = r"C:\Users\jwely\mosaics"  # directory of our MODIS mosaics
fmt      = "%Y%j"
fmt_mask = range(9,16)

rs.from_directory(rastdir, fmt, fmt_mask)

The rastdir variable in this example is the same place our MODIS mosaics are stored from the MODIS mosaic example. The fmt variable should be familiar to you from the time_series example above, but the fmt_mask is new. The fmt_mask is simply a list of character locations within the filename strings where the information matching fmt can be found. Our filenames look something like MOD10A1.A2015001.mosaic.005.2015006011526_FracSnowCover.tif. And fmt_mask is simply telling us that characters 9 through 15 are where the date information can be found. Remember python counts up from zero.

For this example, we want to remove all values that are not telling us about the fractional snow cover on the ground. Since these values are represented by integers 1 through 100, we can set all other values to NoData with

rs.null_set_range(high_thresh = 100, NoData_Value = 101)

Which simply calls raster.null_set_range on every image in this rast_series, and sets all numbers over 100 to equal 101, and sets 101 as the NoData_Value for these images. Now lets say we want rolling average type statistics, with a window 7 days wide to represent a week of data centered around any given day. We can divide the data in week long chunks (with overlap) in the same way we do with time_series.

rs.make_subsets("%d", overlap_width = 3)
rs.interrogate()

An overlap width of three signifies grabbing three days on each side of the center day. And now to take the statistics with.

statsdir = r"C:\Users\jwely\statistics"   # directory to save our statistics
rs.series_stats(statsdir)

Which calls the raster.many_stats function on every subset in this rast_series object. Now we have weekly snapshots for each day in our data record!

Code Help

Auto-documentation for functions and classes within this module is generated below!

class time_series(name='name', units=None, subsetted=False, disc_level=0, parent=None)[source]

A subsettable time series object

The primary motivation for creating this object was to allow a time series to be subsetted into any number of small chunks but retain the ability to process and interrogate the time series at any level with the exact same external syntax.

A time series object is comprised of a matrix of data, and may contain an object list of subset time_series objects. Potentially unlimited nesting of time series datasets is possible, for example: a years worth of hourly data may be subsetted into 1-month time series, while each of those is in turn subsetted into days. The highest level time series will still allow opperations to be performed upon it.

All internal methods are built to handle this flexible definition of a time series, where the steps of the method depend on weather the time series is at its smallest subset or not.

MEMORY WARNING. The entirety of the dataset is represented in every layer of subsetting, so watch out for exploding memory consumption by excessive subsetting of gigantic datasets.

_build_time(time_header, fmt, start_date=False)[source]

This internal use function is called twice by “define_time”. Once to turn all the datestamps into datetime objects, then a second time once the entire dataset has been sorted in ascending time order by those datetime objects.

This is to ensure all time values are in terms of the correct start time, which can be no later than the earliest entry in the dataset.

Parameters:
  • time_header – name of column with time data in it
  • fmt – the fmt string to interpret time data into datetime objects
  • start_date – The date to count up from.
static _center_datetime(datetime_obj, units)[source]

Returns datetime obj that is centered on the “unit” of the input datetime obj

When grouping datetimes together, center times are important. This function allows a center time with units equal to the users input (years, months, days , ...) to be generated from the first datetime of the time series.

Parameters:
  • datetime_obj – any datetime object
  • units – units by which to center input datetime object
Return center_datetime:
 

returns centered datetime object.

_extract_time(time_header)[source]

special case of “extract_column” method for time domain.

static _fmt_to_units(in_fmt)[source]

converts fmt strings to unit names of associated datetime attributes

Parameters:fmt – datetime object style unit characters like %Y or %m
Return units:english version of input format. Example %Y -> “year”
_get_atts_from(parent_time_series)[source]

Allows bulk setting of attributes. Useful for allowing a subset to inherit information from its parent time_series

Parameters:parent_time_series – a time_series object from which to inherit
_name_as_subset(binned=False)[source]

uses time series object to descriptively name itself. Naming subsets as bins will name them based only on smallest unit of discretization.

Parameters:binned – set to True to name subsets as bins
static _seconds_to_units(seconds, units)[source]

converts seconds to other time units

Parameters:
  • seconds – number of seconds
  • units – units to convert those seconds to
Returns:

time equivalent of input seconds expressed as input units

static _units_to_fmt(units)[source]

converts unit names to fmt strings used by datetime.stftime.

Parameters:units – units are strings like “year”, “month”, “hour”
Return fmt:returns “fmt” equivalent of english units
_units_to_seconds(units, dto=None)[source]

converts other time units to seconds

Parameters:units – some english unit such as “hour”, “day”, etc.
Return seconds:the numer of seconds in an input unit
add_mono_time()[source]

Adds a monotonically increasing time column with units of decimal days

build_col_data()[source]

builds columnwise data matrix with an actual dict

clean(col_header, high_thresh=False, low_thresh=False)[source]

Removes rows where the specified column has an invalid number or is outside the defined thresholds (above high_thresh or below low_thresh)

Parameters:
  • col_header – name of column to clean
  • high_thresh – maximum valid value of data in that column
  • low_thresh – minimum valid value of data in that column
column_plot(col_headers, title='', xlabel='', ylabel='', save_path=None)[source]

plots a specific column or column(s) by header name

Accepts custom title input and y-axis label. If a save_path is specified, it will save the plot to that path and close it automatically.

Parameters:
  • col_headers – list of columns to plot
  • title – title to place on plot
  • xlabel – label for x axis
  • ylabel – label for y axis
  • save_path – filepath at which to save figure as image.
column_stats(col_header)[source]

takes statistics on a specific column of data

creates object attributes according to the column name. for example:

for col_header = “temperature”, the following attribute are created

self.temperature_max_v # maximum value self.temperature_min_v # minimum value self.temperature_max_i # index value where maximum occurs self.temperature_min_i # index value where minimum occurs self.temperature_avg # average self.temperature_std # standard deviation

Parameters:col_header – name of column on which to take statistics
Return statistics:
 a dictionary of the column statistical values
define_time(time_header, fmt, start_date=False)[source]

Converts time strings into time objects for standardized processing. For tips on how to use ‘fmt’ variable, see [https://docs.python.org/2/library/datetime.html#strftime-strptime-behavior]

Time header variable can be either the header string, or a column index num

Creates
A converted list of time objects (self.time_dom) A new list of monotonically increasing decimal days (self.time_dec_days) A new list of monotonically increasing second values (self.time_seconds)
Parameters:
  • time_header – name of column with time data in it
  • fmt – the fmt string to interpret time data into datetime objects
  • start_date – The date to count up from.
from_csv(filepath, delim=', ')[source]

Simple reader of a delimited file. To read more complex text data into a time series object, use a custom reader function to return a text_data_class object and feed it into this time series with

time_series_object.from_tdo(text_data_object)

To read csvs straight to a time_series object, it must have headers.

from_list(data, headers, time_header, fmt)[source]

creates the time series data from a list

Parameters:
  • data – list of lists making up rows and columns of data
  • headers – list of headers (column names)
  • time_header – string of header over the column representing time
  • fmt – the format of data in that time column
from_tdo(tdo)[source]

reads time series data from a dnppy.text_data_class object

Parameters:tdo – a dnppy.text_data_class object containing time data
group_bins(fmt_units, overlap_width=0, cyclical=True)[source]

Sorts the time series into time chunks by common bin_unit

used for grouping data rows together. For example, if one used this function on a 5 year dataset with a bin_unit of month, then the time_series would be subseted into 12 sets (1 for each month), which each set containing all entries for that month, regardless of what year they occurred in.

Parameters:
  • fmt_units – %Y groups files by year %m groups files by month %j groups file by julian day of year
  • overlap_width – similarly to “make_subsets” the “overlap_width” variable can be set to greater than 1 to allow “window” type statistics, so each subset may contain data points from adjacent subsets. However, for group_bins, overlap_width must be an integer.
  • cyclical – “cyclical” of “True” will allow end points to be considered adjacent. So, for example, January will be considered adjacent to December, day 1 will be considered adjacent to day 365.
interp_col(time_obj, col_header)[source]

For input column, interpolate values to estimate value at input time_obj. input time_obj may also be of datestring matching declared fmt.

Parameters:
  • time_obj – A datetime object
  • col_header – The name of the column to interpolate at time (time_obj)
Return interp_y:
 

The interpolated value of input column at input time

interrogate()[source]

prints a heads up stats table of all subsets in this time_series

make_subsets(subset_units, overlap_width=0, cust_center_time=False, discard_old=False)[source]

splits the time series into individual time chunks

used for taking advanced statistics, usually periodic in nature. Also useful for exploiting temporal relationships in a dataset for a variety of purposes including data sanitation, periodic curve fitting, etc.

Parameters:
  • subset_units – subset_units follows convention of fmt. For example: %Y groups files by year %m groups files by month %j groups file by julian day of year
  • overlap_width

    this variable can be set to greater than 0 to allow “window” type statistics, so each subset may contain data points from adjacent subsets.

    overlap_width = 1 is like a window size of 3 overlap_width = 2 is like a window size of 5

    WARNING: this function is imperfect for making subsets by months. the possible lengths of a month are extremely variant, so sometimes data points at the ends of a month will get placed in the adjacent month. If you absolutely require accurate month subsetting, you should use this function to subset by year, then use the “group_bins” function to bin each year by month. so,

ts.subset("%b")     # subset by month

ts.subset("%Y")     # subset by year
ts.group_bins("%b") # bin by month
Parameters:
  • cust_center_time – Allows a custom center time to be used! This was added so that days could be centered around a specific daily acquisition time. for example, its often useful to define a day as satellite data acquisition time +/- 12 hours. if used, “cust_center_time” must be a datetime object!
  • discard_old – By default, performing a subsetting on a time series that already has subsets does not subset the master time series, but instead the lowest level subsets. Setting “discard_old” to “True” will discard all previous subsets of the time series and start subsetting from scratch.
merge_cols(header1, header2)[source]

merges two columns together (string concatenation) into a new column. The new column will be named [header1]_[header2].

Parameters:
  • header1 – the name of the 1st column to merge
  • header2 – the name of the 2nd column to merge
normalize(col_header)[source]

Used to normalize specific columns in the time series. Normalization will scale all value in the time series to be between 0 and 1

rebuild(destroy_subsets=False)[source]

Reconstructs the time series from its constituent subsets

Parameters:destroy_subsets – Set to TRUE to destroy the existing subsets of the time series, which will allow them to rebuilt in a different manner.
rename_header(header_name, new_header_name)[source]

renames a header and updates data structures

Parameters:
  • header_name – name of an existing header
  • new_header_name – new name of that header
subset_stats(col_header)[source]

Creates a new time_series object, which is built from column statistics of this time_series’s subsets. For example:

Lets say we have a years worth of hourly temperature data, and we want to get daily summaries of temperature statistics. To do this, the syntax would look like this:

temperature_ts.make_subsets(%d)
daily_sum_ts = temperature_ts.subset_stats("Temp")

This function is not yet finished.

to_csv(csv_path)[source]

Writes the row data of this time_series to a csv file.

Parameters:csv_path – filepath at which to create new csv file.
class rast_series(name='name', units=None, subsetted=False, disc_level=0, parent=None)[source]

This is an extension of the time_series class

It is built to handle just like a time_series object, with the simplification that the “row_data” attribute, is comprised of nothing more than filepaths and filenames. All attributes for tracking the time domain are the same.

some time_series methods for manipulating and viewing text data will not apply to raster_series. It is unknown how they will behave.

from_directory(directory, fmt, fmt_unmask=None)[source]

creates a list of all rasters in a directory, then passes this list to self.from_rastlist

see from_rastlist

from_rastlist(filepaths, fmt, fmt_unmask=None)[source]

Loads up a list of filepaths as a time series. If filenames contain variant characters that are not related to date and time, a “fmt_unmask” may be used to isolate the datestrings from the rest of the filenames by character index.

For an example filename "MYD11A1.A2013001_day_clip_W05_C2014001_Avg_K_C_p_GSC.tif" We only want the part with the year and julian day in it "2013001" So we can use

fmt        = "%Y%j"
fmt_unmask = [10,11,12,13,14,15,16]

To indicate that we only want the 10th through 16th characters of the filenames to be used for anchoring each raster in a physical time.

group_bins(fmt_units, overlap_width=0, cyclical=True)[source]

Sorts the time series into time chunks by common bin_unit

used for grouping data rows together. For example, if one used this function on a 5 year dataset with a bin_unit of month, then the time_series would be subseted into 12 sets (1 for each month), which each set containing all entries for that month, regardless of what year they occurred in.

Parameters:
  • fmt_units – %Y groups files by year %m groups files by month %j groups file by julian day of year
  • overlap_width – similarly to “make_subsets” the “overlap_width” variable can be set to greater than 1 to allow “window” type statistics, so each subset may contain data points from adjacent subsets. However, for group_bins, overlap_width must be an integer.
  • cyclical – “cyclical” of “True” will allow end points to be considered adjacent. So, for example, January will be considered adjacent to December, day 1 will be considered adjacent to day 365.
make_subsets(subset_units, overlap_width=0, cust_center_time=False, discard_old=False)[source]

splits the time series into individual time chunks

used for taking advanced statistics, usually periodic in nature. Also useful for exploiting temporal relationships in a dataset for a variety of purposes including data sanitation, periodic curve fitting, etc.

Parameters:
  • subset_units – subset_units follows convention of fmt. For example: %Y groups files by year %m groups files by month %j groups file by julian day of year
  • overlap_width

    this variable can be set to greater than 0 to allow “window” type statistics, so each subset may contain data points from adjacent subsets.

    overlap_width = 1 is like a window size of 3 overlap_width = 2 is like a window size of 5

    WARNING: this function is imperfect for making subsets by months. the possible lengths of a month are extremely variant, so sometimes data points at the ends of a month will get placed in the adjacent month. If you absolutely require accurate month subsetting, you should use this function to subset by year, then use the “group_bins” function to bin each year by month. so,

ts.subset("%b")     # subset by month

ts.subset("%Y")     # subset by year
ts.group_bins("%b") # bin by month
Parameters:
  • cust_center_time – Allows a custom center time to be used! This was added so that days could be centered around a specific daily acquisition time. for example, its often useful to define a day as satellite data acquisition time +/- 12 hours. if used, “cust_center_time” must be a datetime object!
  • discard_old – By default, performing a subsetting on a time series that already has subsets does not subset the master time series, but instead the lowest level subsets. Setting “discard_old” to “True” will discard all previous subsets of the time series and start subsetting from scratch.
null_set_range(high_thresh=None, low_thresh=None, NoData_Value=None)[source]

Applies the dnppy.raster.null_set_range() function to every raster in rast_series

series_stats(outdir, saves=['AVG', 'NUM', 'STD', 'SUM'], low_thresh=None, high_thresh=None)[source]

Applies the dnppy.raster.many_stats() function to each of the lowest level subsets of this rast_series.