Contents
- Introduction - Processes and compositions - Processes - Compositions - Composition organization - File naming conventing - Hierarchical folder structure - XML coded parameterization - <userproj> - Defining the Framework superuser - <period> - <process> - Boolean tags for <overwrite>, <delete>, <acceptmissing>, and <update> - <parameters> - <srcpath> and <dstpath> - <srccomp> and <dstcomp> - Other tags
Introduction
Karttur’s GeoImagine Framework is built with object oriented classes and methods. The object oriented concept in programming means that all items belong to a class. Items belonging to a certain class have both properties (or attributes) and methods (processes or functions) associated with that class. In the Framework the most important objects are functions (called processes) and spatial data collections (called compositions). The parameterization of processes and compositions is encoded in eXtensible Markup Language (xml) files. This post first summarizes the concepts of processes and compositions and then explains how this is translated to xml structured codes.
Processes and compositions
All functionalities of the Framework are encoded in object oriented processes. Most processes operate on spatial data, but not all; processes are also used for building and managing the database and setting up the processes themselves.
Most processes, however, do use spatial data either as input (or source [src] data) and/or output (or destination [dst] data). Source data is usually denoted with the abbreviation src and destination data with dst. All data, regardless of type, belong to a composition.
Processes
There are hundreds of different processes defined within the Framework. You can get a list of all available processes from the top menu item sub processes.
Processes can be regarded as high level Geographic Information System (GIS) functions. The more basic processes are in fact nothing else than interfaces to standard GIS functions. Other processes represent sequences of standard GIS functions. And then there are functions, including for modeling and machine learning, that can not be found in standard GIS software packages. Compared to an ordinary GIS software package (e.g. ArcGIS, SAGA, QGIS, GRASS etc), Karttur´s GeoImagine Framework is much more demanding to learn and operate when starting,. But once you understand the Framework and have gained knowledge about the Python packages you use, you can add any spatial process you can think of. Another advantage with the Framework is that you can combine any number of processes and then run them for data over any region, or the entire Earth, in one go.
Compositions
A composition can be a single file, like the map of global countries, or thousands of files, like the red reflectance of tiles from the MODIS sensor. When asking the Framework to do processes that involve data, the composition(s) to use must be stated. The processing will be done for all compositions falling within the defined spatial and temporal domain (this will become clear further down).
All compositions have an id that is composed of two parts, a thematic part and a content part, separated by an underscore:
theme_content
Neither the theme, nor the content are allowed to contain any underscore themselves. Each composition id is linked to a scale factor (scalefac), an offset add factor (offsetadd) a pre-defined numeric type (celltype [.e.g Byte, Int16, UInt16 etc]) a nodata value (cellnull), a data unit (dataunit), and must have a scale measure (n[ominal], o[rdinal], i[nterval)] or r[atio]).
A composition can contain different products, as long as the scalefac, offsetadd, celltype, cellnull, dataunit and measure are identical. This means that two versions of the same data, for example derived with a slightly different algorithmic definition, can belong to the same composition. Also reflectance bands from different sensors can share the same composition, as can rainfall data from different sources. But normally data from different origins or sensors are also represented as different compositions. All compositions contain the following components or parts:
- source
- product
- folder (or theme)
- band
- prefix
- suffix
The source is the origin of the dataset (e.g. the satellite platform, the organization or individual behind the data), product is usually a coded product identifier or the producer, folder is a thematic identifier (the “theme” part of the compid), band is a content identifier with prefix usually identical to band but can be set differently, and suffix is a more loose part that can be freely set but usually represents a version identifier.
Compositions as such do not relate to any spatial extent or temporal validity.
Composition organization
The Framework forces a very strict file naming convention for all file-stored data that is imported and produced. This file naming convention is intertwined with the compositions. Alongside the strict file naming, also the hierarchical folder structure is strictly defined.
File naming conventing
All data file names are composed of five (5) parts of which three relate to the composition (above), and the remaining two to location and timestamp. In the filename, the parts are separated by underscore, (“_”), and the parts are not allowed to contain any underscore themselves:
- band or content identifier (from composition)
- product or producer (from composition)
- location
- timestamp
- suffix (from composition)
All files in the Framework thus have the following general format:
content_product_location_timestamp_suffix.extension
Within each part, a hyphen (“-“) is used for separating different codes or labels. A hyphen in the timespan part, for example, denotes that the data in the file represent aggregated data for the period between two dates.
Hierarchical folder structure
All the data files are organized in a hierarchical folder structure with the following levels:
- system
- source (from composition)
- division (tiles, region or mosaic)
- folder (from composition)
- location
- timestamp
The two lowest hierarchical levels, location and timestamp, are identical to the location and timestamp in the file name. The thematic folder can be anything like “rainfall”, “elevation”, “landform” or other thematic identifier and equals the folder in the composition. division can only take three different values “tiles”, “region” or “mosaic”. The user can not set which of the three to use, that is linked to each process and coded in the database (put differently, the division is a fixed attribute of the process object). The source level identifies the source or origin of the data and is equal to the composition source, for instance “TRMM” for TRMM rainfall data. The top system level is also a fixed attribute of each process, and relate to the different systems for representing data that is included in the Framework (e.g. ancillary, modis or sentinel), plus one system (system) for some default data.
XML coded parameterization
All processes are initiated and parameterized using xml coded instructions. And all processes require some basic and common instructions, and then also process specific instruction. All processes must contain three child tags directly under the root (the root itself can be called anything):
- <userproj>
- <period>
- <process>
A single xml can contain any number of <process> tags, they will be executed in the sequence of appearance. But the processes in a single xml file must be for a particular location and a particular time period and temporal resolution. The location must be predefined and owned by the user running the processes and given in the <userproj> tag. The temporal framework of the process is defined in the <period> tag. Only the first instances of the <userproj> and <period> tags are read, any duplicate tags are ignored.
<?xml version='1.0' encoding='utf-8'?>
<anyNameForRoot>
<userproj userid = 'karttur' projectid = 'karttur' tractid= 'karttur-trmm' siteid = '*' plotid = '*' system = 'ancillary'></userproj>
<period startyear = "1998" endyear = "2017" timestep='M'></period>
<process processid = 'processToExecture' version = '1.3'>
<parameters param1= 'value1'></parameters>
</process>
</anyNameForRoot>
<userproj>
All the processes included in Karttur´s GeoImagine Framework must be associated with a user, a project, a region and a system, all given in the <userproj> tag. The Framework spatial definitions are also hierarchical, and the concept and setup of region is introduced in this post.
If you do not alter the default settings when setting up the Framework database, the topic of the another post, the superuser <userproj> tag for running system processes (like installing processes) looks like this:
userproj userid = 'karttur' projectid = 'karttur' tractid= 'karttur' siteid = '*' plotid = '*' system = 'system'></userproj>
Defining the Framework superuser
The Framework default superuser is called karttur. The user karttur is also by default given a project that is also called karttur and that include a tractid that is also called karttur. karttur is thus the name of a user, a project and a global region (or tract). Thus defined, the user karttur has the rights to perform all processes available within the Framework.
You can change the name of the user, project and tractid for your superuser. That is done when you setup the links to database in one of the following posts and then run the database installation as described in yet another post. To change the name of the superuser, and the associated project and tract you need to edit the xml files used for setting up the Framework database.
<period>
The period tag is used for defining the temporal setting of the processes. The tag is set up as its own process called periodicity. Some processes are not associated with any temporal resolution, including the setting up of processes itself. The period tag is thus not strictly needed for all processes.
All the parameters that are expected under the period tag are attributes, and all have a default value. If no attributes are given, the process expects a single static file with no date. If only startyear is given, startmonth and startday are both set to 1 (i.e. to 1 January). If only endyear is given, the endmonth is set to 12 and endday to 31 (i.e. to 31 December).
Apart from setting the correct start and end dates, the temporal interval must also be set. The coding of the intervals largely follows the Python Data Analysis Library, pandas. pandas should be part of your Anaconda installation. If not, installation instructions are directly on top of pandas home page.
In Karttur´s GeoImagine Framework the attribute timestep is used for defining the temporal interval, and loosely corresponds to the offset aliases defined in pandas. The table below summarises the timestep or offset aliases that are used in the Framework:
Karttur | Pandas | string code | date code | Description |
---|---|---|---|---|
D | D | “yyyymmdd” | yyyymmdd | daily |
M | MS | “yyyymm” | yyyymm01 | monthly |
Q | Q | “yyyyq” | yyyymm01 | quarterly |
A | AS | “yyyy” | yyyy0101 | annual |
XD | – | “yyyydoy” | yyyymmdd | day interval |
While the pandas offset aliases “M” represents month end frequency, in Karttur’s coding it means the monthly aggregated value. In the file naming convention, monthly data is represented as “YYYYMM”. In the database the datetime object for any “M” is represented by the first day of that month. Strictly speaking this corresponds to Pandas offset alias “MS”. Quarterly (“Q”) and Annual (“A”) timesteps are constructed in a similar manner. Karttur does not use Pandas offset alias weekly (“W”). Instead Karttur includes a day-interval timestep that can be set to any interval. “7D” would then correspond to a weekly timestep.
In the GeoImagine Framework, data representing statistical or aggregated conditions over a timespan are given derived temporal codings. These derived codes do not have any datetime code, only a string code.
Karttur | string code | Description |
---|---|---|
timespan-D | “yyyymmdd-yyyymmdd” | aggregate daily data |
timespan-M | “yyyymm-yyyymm” | aggregate monthly data |
timespan-Q | “yyyyq-yyyyq” | aggregate quarterly data |
timespan-A | “yyyy-yyyy” | aggregate annual data |
timespan-XD | “yyyydoy-yyyydoy” | aggregate day interval data |
seasonal-D | “yyyy-yyyy@Ddoy” | daily seasonal average |
seasonal-M | “yyyy-yyyy@Mmm” | monthly seasonal average |
seasonal-Q | “yyyy-yyyy@MQq” | quarterly seasonal average |
seasonal-XD | “yyyy-yyyy@XDdoy” | day interval seasonal average |
See the periodicity process page for more details.
<process>
To run a process, the xml file must also include at least a single <process> tag.
<process processid = 'exporttobyteancillary' version = '1.3'>
<overwrite>True</overwrite>
<parameters palette= 'precipln'></parameters>
<srcpath volume = "srcvolume" hdrfiletype = 'tif' datfiletype = 'tif'></srcpath>
<dstpath volume = "dstvolume" hdrfiletype = 'tif' datfiletype = 'tif'></dstpath>
<srccomp>
<trmm-3b43v7-precip id = 'layer1' source = "trmm" product = "3b43" folder = "rainfall" band = "trmm-3b43v7-precip" prefix = "rainfall" suffix = "v7-f">
</trmm-3b43v7-precip>
</srccomp>
</process>
Boolean tags for <overwrite>, <delete>, <acceptmissing>, and <update>
All processes have four Boolean (Yes/No or True/False) parameters: overwrite, delete, acceptmissing and update. If not explicitly stated in the xml file, all are defaulted to False. The parameters “<overwrite>True</overwrite>” forces the process (exporttobyteancillary in the example above) to overwrite any existing destination (dst) files.
<parameters>
The <parameters> tag is a child of the <process> tag, and contains processes specific parameters. You need to check each process to get information on the parameters to set. Some parameters are defaulted and need not be stated, whereas other parameters are mandatory. If you forget to set mandatory parameters, the process will halt and report the missing parameter.
<srcpath> and <dstpath>
If the process requires source (src) data stored in files as input, the tag <srcpath> is required for identifying the volume and the filetype. Similarly, if the process produces data stored in files the tag <dstpath> is required. For most processes the filetype attribute is not required. By default all spatial raster data is represented as GeoTiff files (.tif), and all spatial vector data as ESRI shape files (.shp).
<srccomp> and <dstcomp>
<srccomp> must be given if the processes expects source data on file. The source data must state the composition parts as attributes:
- source
- product
- folder
- band
- prefix
- suffix
Not all processes that produce destination data on file require that the <dstcomp> is given. Some processes include hardcoded definitions of both the destination composition, the file name and the hierarchical folder structure. Different time series analysis typically produces destination compositions with the folder hierarchy and the file name derived from the source composition, with no possibility for user changes. Other processes does allow some restricted user changes, whereas the majority of processes require that the user give the full definitions of the destination composition(s). Apart from the components listed above, destination compositions must also include attributes for:
- scalefac
- offsetadd
- celltype
- cellnull
- dataunit
- measure
If the Framework detects any conflicts with existing compositions, the process will not proceed until this is corrected.
Other tags
A few other child tags under <process> are used in specific processes, including:
- <node> (in e.g. addsubproc for defining process parameters)
- <stats> (in e.g. trendtcancillary for defining statistical measure to produce)
- <comps> (in e.g. createscaling for defining scaling when exporting data)
The pages for each individual process lists all the tags and attributes required.