Postgres setup with Python & xml

Contents - Introduction - Prerequisites - Python package for setting up schemas and tables - Main module - xml files defining the schemas and tables - Class module (db_setup_class.py) - Functions in db_setup_class.py - ReadSqlXml - BoolTag - CreateSchema - GetTableKeys - CreateTable - TableInsert - TableUpdate - general_processes - Using the setup-db-karttur package

Introduction

This post describes the automatic setup of schemas and tables for a PostgreSQL (or Postgres for short) database (db) from Python. Since it was published (in January 2018) it has become redundant if you clone or download Karttur´s GeoImagine Framework from the GitHub.com.

However, the Python processes described are generic, and available as a Python package (db_setup) on GitHub.com. Schemas and tables are defined in xml (eXtensible Markup Language) files, and by editing the xml contents, any schema and table architecture can be setup. The included xml files are for setting up Karttur’s Geo Imagine processing framework.

The intention of this blog is that you can either use it for setting up your own architecture of schemas and tables, or just use it for setting up Karttur’s Geo Imagine processing framework. In the former case, you must create your own xml files defining the architecture, in the latter case you can use the xml files on GitHub.com as they are.

You can also choose to either follow the instructions to understand how the Python package db_setup works, or download the package from GitHub.com. The package is, however, custom built for the Python and PostgreSQL setup as stated in the next section (Prerequisites).

Prerequisites

To follow this post, and to run the Python package db_setup, you must have Python and Postgres server running on your machine. You must have a .netrc file in your home directory with your Postgres login credentials, and your Postgres server must be in the port 5432 with a database cluster called ‘postgres’. The instructions for installations and setup of the framework used in this post include:

Anaconda (Python Interpreter)
Eclipse (Integrated Development Environment - IDE)
PostgreSQL (database server)
psycopg2 (connection between Python and Postgres)

Python package for setting up schemas and tables

If you want to follow this step by step instruction, start Eclipse.

Main module (db_setup_main.py)

If you followed my earlier post and created the package db_setup_karttur, you must now add a new function SetupSchemasTables to the package main module db_setup_main.py, and then call it from __main__:

def SetupSchemasTables(projFPN,db,verbose):
    '''
    Setup schemas and tables
    '''
    #Read the textfile with links to the xml files defining schemas and tables
    dirPath = path.split(projFPN)[0]
    with open(projFPN) as f:
        xmlL = f.readlines()
    # Remove whitespace characters like `\n` at the end of each line
    xmlL = [path.join(dirPath,x.strip())  for x in xmlL if len(x) > 10 and x[0] != '#']
    # Define the database connection
    HOST = 'localhost1'
    secrets = netrc.netrc()
    username, account, password = secrets.authenticators( HOST )
    #encrypt the password
    pswd = b64encode(password)
    #create a query dictionary for connecting to the Postgres server
    query = {'db':db,'user':username,'pswd':pswd}
    #Connect to the Postgres Server
    session = PGsession(query)
    #Loop over all xml files and create Schemas and Tables
    for xml in xmlL:
        session.ReadSqlXml(xml,verbose)

if __name__ == '__main__':
    '''
    This module should only be run at the very startup of building the Karttur Geo Imagine framework.
    To run, remove the comment "#prodDB" and set the name of your production DB ("YourProdDB")
    '''
    #prodDB = 'YourProdDB'

    '''SetUpProdDb creates an empty Postgres database under the cluster postgres with the name "YourProdDB".'''
    SetUpProdDb(prodDB)
    '''
    SetupSchemasTables creates schemas and tables from xml files, with the relative path to the
    xml files given in the plain text file "projFPN".
    '''
    verbose = True
    projFPN = 'doc/db_karttur_setup_20180106_0.txt'
    SetupSchemasTables(projFPN,prodDB,verbose)

The function SetupSchemasTables expects 3 variables:

projFPN - the path to a textfile that lists the xml files defining the schemas and tables to create
db - the name of the database in the postgres cluster for which to create the schemas and tables
verbose - a boolean (True/False) variable defining if progress is to be shown at runtime or not

xml files defining the schemas and tables

In the GitHub.com repository the ‘projFPN’ is in a folder doc under the setup_db package itself. You can also create that folder from Eclipse. Make sure that the package db_setup_karttur is selected, and go via the menu:

File : New : Folder

The text file linking to the xml files that defines the schemas and tables to create should be in the sub-folder doc. In the GitHub repository the text file is called db_karttur_setup_YYYYMMDD_0.txt, where YYYYMMDD is the date of the version (‘20180106’ when I write this post). The content of db_karttur_setup_YYYYMMDD_0.txt consists of comments and (relative) file paths:

# general_schema_vXX_sql.xml installs the default database schemas
xmlsql/general_schema_v80_sql.xml

# general_processes_vXX_sql.xml installs the tables for handling paths and processes and
# the core process handling all other process definitions,
xmlsql/general_processes_v80_sql.xml

The SetupSchemasTables function ignores rows with no text and rows starting with ‘#’. If you look at the paths to the xml files, they all start with xmlsql/. The function SetupSchemasTables interpretes this as a relative path, and thus expects the xml files to be in a sub-folder under the doc folder. If you cloned or downloaded the setup_db repo on GitHub the complete file and folder structure are already in place. If you want to copy and paste into your own structure, create a sub-folder xmlsql under the folder doc.

The first xml file linked in db_karttur_setup_YYYYMMDD_0.txt is general_schema_v80_sql.xml (link to file on GitHub.com), and defines the schema to be created. It must be the first xml file in the list.

<?xml version='1.0' encoding='utf-8'?>
<sql>
	<!-- SQL xml read by geomimagine subpackage db_setup -->
	<process processid ='createschema'>
		<overwrite>N</overwrite>
		<delete>N</delete>
		<parameters db = 'karttur' schema = 'process'></parameters>
	</process>

  <process processid ='createschema'>
  ...
  </process>
  ...

  ...
</sql>

The key node for all Geo Imagine processes is <process>; the childtags <overwrite> and <delete> are universal for all processes, The childnode <parameters> contains process specific parameters. The xml for the <process> ‘createschema’ is simpler compared to almost all other processes, but more about that later.

Class module (db_setup_class.py)

The final loop in the main module

    for xml in xmlL:
        session.ReadSqlXml(xml)

points to the function ReadSqlXml. The version of db_setup_class.py that was created in the previous post does not contain that function, nor any of the other functions that are needed for creating schemas and tables.

You can download the complete version of db_setup_class.py from GitHub, or you can construct it by following the instructions below. The latter will help you understand a bit more about how Python function, but the end result will be the same as the module available on GitHub.

Functions in db_setup_class.py

The module db_setup_class.py can perform a range of different database tasks:

Create schema
Create table
Delete table
Replace table
Insert table records
Delete table records
Replace table records
Update table records

To achieve this, db_setup_class.py contains the object oriented Class PGSession. When calling the Class PGSession, the __init__ function expects a dictionary with your login credentials and the name of the db to work with . PGSession responds by returning an instance of itself (__self__). When you (later) call the PGSession, the functions defined in PGSession are available to the returned instance of the Class. In the main module this is utilized in the final loop, that calls the PGSession function ReadSqlXml and when doing so sends a single variable ‘xmlfile’ (the name of an .xml file). The function ReadSqlXml then reads and parses (disentangle) the xml code, and subsquently calls other functions for managing the Postgres db as defined in the xml code. As PGSessions was initiated with the login credentials for your user, the Postgres db connection is available to the returned instance also in each function defined under the PGSession Class.

The initiation (__init__) is the same as in the post where you created PGSession.

class PGsession:
    """Connects the postgres database for processes."""   
    def __init__(self, query):
        query['pswd'] = b64decode(query['pswd'])
        conn_string = "host='localhost' dbname='%(db)s' user='%(user)s' password='%(pswd)s'" %query
        self.conn = psycopg2.connect(conn_string)
        self.cursor = self.conn.cursor()

But the complete PGSession Class in the module db_setup_class.py also contains these additional functions.

ReadSqlXml
_BoolTag
_CreateSchema
_GetTableKeys
_CreateTable
_TableInsert
_TableUpdate

All the functions, except ReadSqlXml starts with underscore (_). This indicates that the function is private and can only be accessed from within the Class PGSession itself, they can not be called from any other function.

ReadSqlXml

ReadSqlXml reads and parses the XML file, and then calls the other (private) class functions dependent on the process(es) defined in the xml file it operates on.


  
def ReadSqlXml(self,xmlFPN,verbose):
        ‘’’ Read XML file ‘’’
        self.verbose = verbose
        dom = minidom.parse(xmlFPN)

        if self.verbose:
            print ‘    xmlfile’, path.split(xmlFPN)[1]   

        processes = dom.getElementsByTagName(‘process’)

    for process in processes:
        overwriteTag = process.getElementsByTagName('overwrite')
        if len(overwriteTag) == 0:
            self.overwrite = False
        else:
            self.overwrite = self._BoolTag(overwriteTag[0].firstChild.nodeValue )
        deleteTag = process.getElementsByTagName('delete')
        if len(deleteTag) == 0:
            self.delete = False
        else:
            self.delete = self._BoolTag(deleteTag[0].firstChild.nodeValue )
        processid = process.getAttribute('processid')
        #Get the parameters
        paramtag = process.getElementsByTagName('parameters')
        schema = paramtag[0].getAttribute('schema')

        if processid in ['createtable','tableinsert','tableupdate']:
            table = paramtag[0].getAttribute('table')
            if self.verbose:
                print '        %s %s %s' %(processid, schema, table)
            command = process.getElementsByTagName('command')
            if not process in ['tableinsert','createschema']:
                cmd = command[0].firstChild.nodeValue
            if processid in ['tableinsert','tableupdate']:
                valueL = []
                columns = command[0].getElementsByTagName('columns')
                columnL = columns[0].firstChild.nodeValue.split(',')
                values = command[0].getElementsByTagName('values')
                for value in values:
                    valueL.append(value.firstChild.nodeValue.split(','))   
        if processid == 'createschema':
            self._CreateSchema(schema)   
        elif processid == 'createtable':
            self._CreateTable(schema,table,cmd)
        elif processid == 'tableinsert':
            self._TableInsert(schema,table,columnL,valueL)
        elif processid == 'tableupdate':
            self._TableUpdate(schema,table,columnL,valueL,cmd)
        else:
            exitstr =  'Initial command not found in initiate.initialize', processid
            exit(exitstr)

_BoolTag

_BoolTag converts the tags for <overwrite> and <delete> (and other parameters defining boolean variable) to either True or False.

_CreateSchema

_CreateSchema creates the schema defined in the xml file. For security reasons I did not include the capacity to drop a schema in then module. If you want to drop a schema you have to do that manually.

_GetTableKeys

_GetTableKeys is a support function that returns the primary keys (constraints) for a table - that is the keys that in combination must constitute a unique value.

_CreateTable

_CreateTable creates a new table if <overwrite> and <delete> are False (default), deletes a table if <delete> is set to True, and deletes and then recreates a table if <overwrite> is set to True. Both <delete> and <overwrite> are universal parameters for all processes in the Geo Imagine framework. If they are omitted from the xml they are by default set to False.

_TableInsert

_TableInsert inserts, deletes or overwrites data to a table using the same logic as in the function _CreateTable.

_TableUpdate

_TableUpdate updates existing records in a table.

_Close

def _Close(self): self.cursor.close() self.conn.close()

General processes

The excerpt of general_processes_v80_sql.xml shows the general structure of the xml files that are used for creating tables. Incidentally it is also the xml file that defines the process for later adding all other processes if you intend to setup Karttur’s Geo Imagine processing framework.

<process processid ='createtable'>
		<parameters db = 'karttur' schema = 'process' table = 'rootprocesses'></parameters>
		<overwrite>N</overwrite>
		<delete>N</delete>
		<command>
			rootprocid TEXT,
			roottitle varchar(128),
			rootlabel varchar(255),
			creator varchar(32),
			PRIMARY KEY (rootprocid)
		</command>
	</process>

	<process processid ='createtable'>
		<parameters db = 'karttur' schema = 'process' table = 'subprocesses'></parameters>
		<overwrite>N</overwrite>
		<delete>N</delete>
		<command>
			rootprocid TEXT,
			subprocid TEXT,
			version varchar(16),
			minuserstratum integer,
			subtitle varchar(128),
			sublabel varchar(255),
			creator varchar(32),
			access varchar(8),
			createdate date,
			PRIMARY KEY (subprocid,version)
		</command>
	</process>

Both <process> nodes in the excerpt above contains the same tags and attributes. Only the attribute table (in the <parameters> tag) and the tag <command> text differ. The text in the <command> tag is an SQL statement, except that the SQL must also include the schema and table for which to execute the command. When converted to SQL by PGsession the SQL sent to Postgres for the top <process> tag will become:

CREATE TABLE process.rootprocesses (
  rootprocid TEXT,
  roottitle varchar(128),
  rootlabel varchar(255),
  creator varchar(32),
  PRIMARY KEY (rootprocid)
);

But only after PGsession made sure that table did not already exist. Further down in the xml file general_processes_v80_sql.xml the ‘rootprocess’ and ‘subprocess’ required to create all other processes is inserted.

<process processid ='tableinsert'>
		<parameters db = 'karttur' schema = 'process' table = 'rootprocesses'></parameters>
		<command>
			<columns>rootprocid,  roottitle,  rootlabel,  creator</columns>
			<values>'manageprocess', 'Manage database defined process', 'Mangaging a processes requires data on all parameters and their type and default values','thomasg'</values>
		</command>
	</process>

Using the db_setup package

The package setup-db is rather a sub-package to Karttur’s Geo Imagine processing framework, but it can be used for setting up any database schema and table architecture. I thus put it in a separate Github repository and made it publicly availabel. If you want to use the package for setting up your own db architecture you need to define your schemas and tables in your own xml files, and link them via a text file. If you want to setup Karttur’s Geo Imagine processing framework, all you have to do is run the package.