tsa¶
The tsa (short for time series analysis) module is centered around the time_series class. One or more time_series objects should be central to any data analysis task that examines temporal relationships in data sets of raster or tabular format. This module also houses the rast_series class, which is an extension of time_series for handling filepaths to raster data.
Examples¶
time_series basics
Users with text data, or sequential raster images with identical spatial extents may wish to perform time specific operations on that data set. A time_series object may be easily used to perform tasks like subsetting, plotting, sorting, taking statistics, interpolating between values, sanitizing data for bad entries, and more.
This use case below is an example of using weather data downloaded from this NOAA website. Firstly, take a look at the format of this sample weather data, and take special note of the column labeled "YR--MODAHRMN"
and the format of it.
We want to parse this weather data, and perform a variety of manipulations on it, but first we have to get it into python. To see how this is done, open up your preferred interpreter (IDLE by default) and retype this code step by step.
from dnppy import textio
filepath = "test_data/weather_dat.txt" # define the text filepath
tdo = textio.read_DS3505(filepath) # build a text data object
print(tdo.headers) # print the headers
print(tdo.row_data[0]) # print the first row
We now have text_data
object. To read more about text_data
objects, check out the textio
module.
To turn this tdo into a time_series
, we can do
from dnppy import tsa # import the tsa module
ts = tsa.time_series('weather_data')
ts.from_tdo(tdo) # use contents of the tdo
print(ts.headers) # print the headers
print(ts.row_data[0]) # print the first row
We can see that similar headers and the same row data can be found in the time_series
object ts
. Next we need to tell python how to interpret this data to assign times to each row. dnppy does this with python datetime
objects from the native datetime
module. The actual operations are hidden to the user, but we need to use datetime
syntax to tell it how our dates are formatted like so:
timecol = "YR--MODAHRMN" # time data column header
fmt = "%Y%m%d%H%M" # datetime format
ts.define_time(timecol, fmt) # interpret strings
ts.interogate() # print a heads up summary
As we can see above, we are telling the time_series
object ts
to execute its internal define_time
method with arguments specifying which column contains the time information, and the fmt
string to use to interpret it. This specific fmt
string is used to read strings such as 201307180054
, of format YYYYMMDDHHMMSS. read more on datetime formatting syntax
Now we can do cool stuff, like split it into subsets.
ts.make_subsets("%d") # subset the data into daily chunks
ts.interogate() # print a heads up summary
As you can see from the report, there are now many “subsets” of this time series. Each of these subsets is actually its own time_series object! and all the same manipulations can be performed on them individually as can be performed on the whole series. Lets go ahead and do a quick plot of the temperature data in here.
ts.column_plot("TEMP") # no frills plot of temperature
notice that we used the column name for temperature data from our weather file. This doesn’t make very pretty plots, so if we want, we can rename parameters. Lets say july 21st is the most interesting day and we want to see temperature and dewpoint next to eachtother, but we also want our plot to be a little prettier.
ts.rename_header("TEMP","Temperature") # give header better name
ts.rename_header("DEWP","Dewpoint") # give header better name
jul21 = ts["2013-07-21"] # pull out subset july 21st
jul21.column_plot(["Temperature","Dewpoint"], # make plot with labels
title = "Temperature and Dewpoint",
xlabel = "Date and Time",
ylabel = "Degrees F")
This is much better, but now we decide that we don’t just want july 21st, but also the days on either side of it. To do this, we can use an overlap_width in our subsetting command to make more of a moving window through this time series, centered around july 21st.
ts.make_subsets("%d", overlap_width = 1, # subset with 1 overlap
discard_old = True) # discard the old subsets
ts.interogate() # print a heads up summary
jul21 = ts["2013-07-21"] # pull july 21st subset
jul21.column_plot(["Temperature","Dewpoint"], # save the plot this time
title = "Temperature and Dewpoint",
ylabel = "Degrees F",
save_path = "test.png")
Now we actually have three days in our jul21
time series. And we are happy with this.
So far, this has only introduced you to about a third of the functionality available in the tsa module, but it should be enough to get you started. Consult the internal help docs and function list below to learn more!
rast_series basics
The rast_series
class is a child of time_series
class, where each row in rast_series.row_data
contains the values [filepath, filename, fmt_name]
for a raster image.
There are some cool functions in the raster
module that are tailored to time series type raster images, and the rast_series
is a type of container to manage your raster data and pass it into more complex raster functions. This example is going to use example data following on from the MODIS extracting and mosaicing example
A rast_series
object can be created and populated with
from dnppy import tsa
rs = tsa.rast_series()
rastdir = r"C:\Users\jwely\mosaics" # directory of our MODIS mosaics
fmt = "%Y%j"
fmt_mask = range(9,16)
rs.from_directory(rastdir, fmt, fmt_mask)
The rastdir
variable in this example is the same place our MODIS mosaics are stored from the MODIS mosaic example. The fmt
variable should be familiar to you from the time_series
example above, but the fmt_mask
is new. The fmt_mask
is simply a list of character locations within the filename strings where the information matching fmt
can be found. Our filenames look something like MOD10A1.A2015001.mosaic.005.2015006011526_FracSnowCover.tif
. And fmt_mask
is simply telling us that characters 9 through 15 are where the date information can be found. Remember python counts up from zero.
For this example, we want to remove all values that are not telling us about the fractional snow cover on the ground. Since these values are represented by integers 1 through 100, we can set all other values to NoData with
rs.null_set_range(high_thresh = 100, NoData_Value = 101)
Which simply calls raster.null_set_range
on every image in this rast_series
, and sets all numbers over 100 to equal 101, and sets 101 as the NoData_Value for these images. Now lets say we want rolling average type statistics, with a window 7 days wide to represent a week of data centered around any given day. We can divide the data in week long chunks (with overlap) in the same way we do with time_series
.
rs.make_subsets("%d", overlap_width = 3)
rs.interrogate()
An overlap width of three signifies grabbing three days on each side of the center day. And now to take the statistics with.
statsdir = r"C:\Users\jwely\statistics" # directory to save our statistics
rs.series_stats(statsdir)
Which calls the raster.many_stats
function on every subset in this rast_series
object. Now we have weekly snapshots for each day in our data record!
Code Help¶
Auto-documentation for functions and classes within this module is generated below!
-
class
time_series
(name='name', units=None, subsetted=False, disc_level=0, parent=None)[source]¶ A subsettable time series object
The primary motivation for creating this object was to allow a time series to be subsetted into any number of small chunks but retain the ability to process and interrogate the time series at any level with the exact same external syntax.
A time series object is comprised of a matrix of data, and may contain an object list of subset time_series objects. Potentially unlimited nesting of time series datasets is possible, for example: a years worth of hourly data may be subsetted into 1-month time series, while each of those is in turn subsetted into days. The highest level time series will still allow opperations to be performed upon it.
All internal methods are built to handle this flexible definition of a time series, where the steps of the method depend on weather the time series is at its smallest subset or not.
MEMORY WARNING. The entirety of the dataset is represented in every layer of subsetting, so watch out for exploding memory consumption by excessive subsetting of gigantic datasets.
-
_build_time
(time_header, fmt, start_date=False)[source]¶ This internal use function is called twice by “define_time”. Once to turn all the datestamps into datetime objects, then a second time once the entire dataset has been sorted in ascending time order by those datetime objects.
This is to ensure all time values are in terms of the correct start time, which can be no later than the earliest entry in the dataset.
Parameters: - time_header – name of column with time data in it
- fmt – the fmt string to interpret time data into datetime objects
- start_date – The date to count up from.
-
static
_center_datetime
(datetime_obj, units)[source]¶ Returns datetime obj that is centered on the “unit” of the input datetime obj
When grouping datetimes together, center times are important. This function allows a center time with units equal to the users input (years, months, days , ...) to be generated from the first datetime of the time series.
Parameters: - datetime_obj – any datetime object
- units – units by which to center input datetime object
Return center_datetime: returns centered datetime object.
-
static
_fmt_to_units
(in_fmt)[source]¶ converts fmt strings to unit names of associated datetime attributes
Parameters: fmt – datetime object style unit characters like %Y or %m Return units: english version of input format. Example %Y -> “year”
-
_get_atts_from
(parent_time_series)[source]¶ Allows bulk setting of attributes. Useful for allowing a subset to inherit information from its parent time_series
Parameters: parent_time_series – a time_series object from which to inherit
-
_name_as_subset
(binned=False)[source]¶ uses time series object to descriptively name itself. Naming subsets as bins will name them based only on smallest unit of discretization.
Parameters: binned – set to True to name subsets as bins
-
static
_seconds_to_units
(seconds, units)[source]¶ converts seconds to other time units
Parameters: - seconds – number of seconds
- units – units to convert those seconds to
Returns: time equivalent of input seconds expressed as input units
-
static
_units_to_fmt
(units)[source]¶ converts unit names to fmt strings used by datetime.stftime.
Parameters: units – units are strings like “year”, “month”, “hour” Return fmt: returns “fmt” equivalent of english units
-
_units_to_seconds
(units, dto=None)[source]¶ converts other time units to seconds
Parameters: units – some english unit such as “hour”, “day”, etc. Return seconds: the numer of seconds in an input unit
-
clean
(col_header, high_thresh=False, low_thresh=False)[source]¶ Removes rows where the specified column has an invalid number or is outside the defined thresholds (above high_thresh or below low_thresh)
Parameters: - col_header – name of column to clean
- high_thresh – maximum valid value of data in that column
- low_thresh – minimum valid value of data in that column
-
column_plot
(col_headers, title='', xlabel='', ylabel='', save_path=None)[source]¶ plots a specific column or column(s) by header name
Accepts custom title input and y-axis label. If a save_path is specified, it will save the plot to that path and close it automatically.
Parameters: - col_headers – list of columns to plot
- title – title to place on plot
- xlabel – label for x axis
- ylabel – label for y axis
- save_path – filepath at which to save figure as image.
-
column_stats
(col_header)[source]¶ takes statistics on a specific column of data
creates object attributes according to the column name. for example:
for col_header = “temperature”, the following attribute are created
self.temperature_max_v # maximum value self.temperature_min_v # minimum value self.temperature_max_i # index value where maximum occurs self.temperature_min_i # index value where minimum occurs self.temperature_avg # average self.temperature_std # standard deviation
Parameters: col_header – name of column on which to take statistics Return statistics: a dictionary of the column statistical values
-
define_time
(time_header, fmt, start_date=False)[source]¶ Converts time strings into time objects for standardized processing. For tips on how to use ‘fmt’ variable, see [https://docs.python.org/2/library/datetime.html#strftime-strptime-behavior]
Time header variable can be either the header string, or a column index num
- Creates
- A converted list of time objects (self.time_dom) A new list of monotonically increasing decimal days (self.time_dec_days) A new list of monotonically increasing second values (self.time_seconds)
Parameters: - time_header – name of column with time data in it
- fmt – the fmt string to interpret time data into datetime objects
- start_date – The date to count up from.
-
from_csv
(filepath, delim=', ')[source]¶ Simple reader of a delimited file. To read more complex text data into a time series object, use a custom reader function to return a text_data_class object and feed it into this time series with
time_series_object.from_tdo(text_data_object)
To read csvs straight to a time_series object, it must have headers.
-
from_list
(data, headers, time_header, fmt)[source]¶ creates the time series data from a list
Parameters: - data – list of lists making up rows and columns of data
- headers – list of headers (column names)
- time_header – string of header over the column representing time
- fmt – the format of data in that time column
-
from_tdo
(tdo)[source]¶ reads time series data from a dnppy.text_data_class object
Parameters: tdo – a dnppy.text_data_class object containing time data
-
group_bins
(fmt_units, overlap_width=0, cyclical=True)[source]¶ Sorts the time series into time chunks by common bin_unit
used for grouping data rows together. For example, if one used this function on a 5 year dataset with a bin_unit of month, then the time_series would be subseted into 12 sets (1 for each month), which each set containing all entries for that month, regardless of what year they occurred in.
Parameters: - fmt_units – %Y groups files by year %m groups files by month %j groups file by julian day of year
- overlap_width – similarly to “make_subsets” the “overlap_width” variable can be set to greater than 1 to allow “window” type statistics, so each subset may contain data points from adjacent subsets. However, for group_bins, overlap_width must be an integer.
- cyclical – “cyclical” of “True” will allow end points to be considered adjacent. So, for example, January will be considered adjacent to December, day 1 will be considered adjacent to day 365.
-
interp_col
(time_obj, col_header)[source]¶ For input column, interpolate values to estimate value at input time_obj. input time_obj may also be of datestring matching declared fmt.
Parameters: - time_obj – A datetime object
- col_header – The name of the column to interpolate at time (time_obj)
Return interp_y: The interpolated value of input column at input time
-
make_subsets
(subset_units, overlap_width=0, cust_center_time=False, discard_old=False)[source]¶ splits the time series into individual time chunks
used for taking advanced statistics, usually periodic in nature. Also useful for exploiting temporal relationships in a dataset for a variety of purposes including data sanitation, periodic curve fitting, etc.
Parameters: - subset_units – subset_units follows convention of fmt. For example: %Y groups files by year %m groups files by month %j groups file by julian day of year
- overlap_width –
this variable can be set to greater than 0 to allow “window” type statistics, so each subset may contain data points from adjacent subsets.
overlap_width = 1 is like a window size of 3 overlap_width = 2 is like a window size of 5
WARNING: this function is imperfect for making subsets by months. the possible lengths of a month are extremely variant, so sometimes data points at the ends of a month will get placed in the adjacent month. If you absolutely require accurate month subsetting, you should use this function to subset by year, then use the “group_bins” function to bin each year by month. so,
ts.subset("%b") # subset by month ts.subset("%Y") # subset by year ts.group_bins("%b") # bin by month
Parameters: - cust_center_time – Allows a custom center time to be used! This was added so that days could be centered around a specific daily acquisition time. for example, its often useful to define a day as satellite data acquisition time +/- 12 hours. if used, “cust_center_time” must be a datetime object!
- discard_old – By default, performing a subsetting on a time series that already has subsets does not subset the master time series, but instead the lowest level subsets. Setting “discard_old” to “True” will discard all previous subsets of the time series and start subsetting from scratch.
-
merge_cols
(header1, header2)[source]¶ merges two columns together (string concatenation) into a new column. The new column will be named [header1]_[header2].
Parameters: - header1 – the name of the 1st column to merge
- header2 – the name of the 2nd column to merge
-
normalize
(col_header)[source]¶ Used to normalize specific columns in the time series. Normalization will scale all value in the time series to be between 0 and 1
-
rebuild
(destroy_subsets=False)[source]¶ Reconstructs the time series from its constituent subsets
Parameters: destroy_subsets – Set to TRUE to destroy the existing subsets of the time series, which will allow them to rebuilt in a different manner.
-
rename_header
(header_name, new_header_name)[source]¶ renames a header and updates data structures
Parameters: - header_name – name of an existing header
- new_header_name – new name of that header
-
subset_stats
(col_header)[source]¶ Creates a new time_series object, which is built from column statistics of this time_series’s subsets. For example:
Lets say we have a years worth of hourly temperature data, and we want to get daily summaries of temperature statistics. To do this, the syntax would look like this:
temperature_ts.make_subsets(%d) daily_sum_ts = temperature_ts.subset_stats("Temp")
This function is not yet finished.
-
-
class
rast_series
(name='name', units=None, subsetted=False, disc_level=0, parent=None)[source]¶ This is an extension of the time_series class
It is built to handle just like a time_series object, with the simplification that the “row_data” attribute, is comprised of nothing more than filepaths and filenames. All attributes for tracking the time domain are the same.
some time_series methods for manipulating and viewing text data will not apply to raster_series. It is unknown how they will behave.
-
from_directory
(directory, fmt, fmt_unmask=None)[source]¶ creates a list of all rasters in a directory, then passes this list to self.from_rastlist
see
from_rastlist
-
from_rastlist
(filepaths, fmt, fmt_unmask=None)[source]¶ Loads up a list of filepaths as a time series. If filenames contain variant characters that are not related to date and time, a “fmt_unmask” may be used to isolate the datestrings from the rest of the filenames by character index.
For an example filename
"MYD11A1.A2013001_day_clip_W05_C2014001_Avg_K_C_p_GSC.tif"
We only want the part with the year and julian day in it"2013001"
So we can usefmt = "%Y%j" fmt_unmask = [10,11,12,13,14,15,16]
To indicate that we only want the 10th through 16th characters of the filenames to be used for anchoring each raster in a physical time.
-
group_bins
(fmt_units, overlap_width=0, cyclical=True)[source]¶ Sorts the time series into time chunks by common bin_unit
used for grouping data rows together. For example, if one used this function on a 5 year dataset with a bin_unit of month, then the time_series would be subseted into 12 sets (1 for each month), which each set containing all entries for that month, regardless of what year they occurred in.
Parameters: - fmt_units – %Y groups files by year %m groups files by month %j groups file by julian day of year
- overlap_width – similarly to “make_subsets” the “overlap_width” variable can be set to greater than 1 to allow “window” type statistics, so each subset may contain data points from adjacent subsets. However, for group_bins, overlap_width must be an integer.
- cyclical – “cyclical” of “True” will allow end points to be considered adjacent. So, for example, January will be considered adjacent to December, day 1 will be considered adjacent to day 365.
-
make_subsets
(subset_units, overlap_width=0, cust_center_time=False, discard_old=False)[source]¶ splits the time series into individual time chunks
used for taking advanced statistics, usually periodic in nature. Also useful for exploiting temporal relationships in a dataset for a variety of purposes including data sanitation, periodic curve fitting, etc.
Parameters: - subset_units – subset_units follows convention of fmt. For example: %Y groups files by year %m groups files by month %j groups file by julian day of year
- overlap_width –
this variable can be set to greater than 0 to allow “window” type statistics, so each subset may contain data points from adjacent subsets.
overlap_width = 1 is like a window size of 3 overlap_width = 2 is like a window size of 5
WARNING: this function is imperfect for making subsets by months. the possible lengths of a month are extremely variant, so sometimes data points at the ends of a month will get placed in the adjacent month. If you absolutely require accurate month subsetting, you should use this function to subset by year, then use the “group_bins” function to bin each year by month. so,
ts.subset("%b") # subset by month ts.subset("%Y") # subset by year ts.group_bins("%b") # bin by month
Parameters: - cust_center_time – Allows a custom center time to be used! This was added so that days could be centered around a specific daily acquisition time. for example, its often useful to define a day as satellite data acquisition time +/- 12 hours. if used, “cust_center_time” must be a datetime object!
- discard_old – By default, performing a subsetting on a time series that already has subsets does not subset the master time series, but instead the lowest level subsets. Setting “discard_old” to “True” will discard all previous subsets of the time series and start subsetting from scratch.
-