Database

The outflow database

Outflow has a bunch of tables, used for internal bookkeeping but that can be useful for users. Among other, outflow records the pipeline execution (called runs) and every task inside them, their state (pending, success, failed or skipped), the exception if one occurred, and the configuration file used for a given run.

You can query these tables using the usual syntax :

from outflow.management.models.configuration import Configuration
from outflow.core.pipeline import context
context.session.query(Configuration).all()

This can be useful for example to check if a given input file has already been processed using the current configuration.

You can also check the state of the different runs with the more friendly interface : the outflow dashboard.

Connect to a postgres database

For pipeline developers

You will need to create a database and two roles (postgres lingo for a user), an admin role that will be owner of the database and have all right on both the layout (the tables structure) and the data, and a user role that only has rights on the data. This allows you to apply migrations and modify the database layout using the admin role, and your pipeline users to acces and fill the database with data. They will not be able to accidentally drop a table and all its data by wrongly applying a downgrade migration for example.

If you manage your own PostgreSQL instance with a local installation for example, here are example commands to create the roles and a database. Ask your IT to do this for you on a real PostgreSQL instance and provide you with the credentials if you can.

CREATE USER "pipeuser" PASSWORD 'userpwd' NOSUPERUSER NOCREATEDB NOCREATEROLE INHERIT;
CREATE USER "pipeadmin" PASSWORD 'adminpwd' NOSUPERUSER CREATEDB NOCREATEROLE INHERIT;
CREATE DATABASE outflow_tuto WITH OWNER pipeadmin;

For everyone

To connect to a postgres database, edit your config.yml file and replace the sqlite section with this one:

#  default:
#    dialect: sqlite
#    path: backend.db

databases:
  default:
    dialect: postgresql
    address: address:port
#    admin: admin_username:admin_password
    user: user_username:user_password
    database: database_name

Replace the address and database fields with adequate ones for you.

If you are a pipeline developer, you need to fill the admin field to apply the migrations. If you are a pipeline user, you only need the user field.

It is also possible to use the same credential in both admin and user fields, this postgres account should have admin right on the database.

Using yaml includes to define database credentials outside the config.yml

If you are not comfortable putting your database credentials inside the config.yml file (if you want to commit the config.yml file to a vcs for example), it is possible to use the include feature described in the [Configuration](Outflow configuration) page of the documentation. Your configuration files may look like this:

# Inside the config.yml
databases: !include databases.yml


# Inside the databases.yml
default:
  dialect: postgresql
  address: address:port
  # admin: admin_username:admin_password
  user: user_username:user_password
  database: database_name

Setup connection to multiple databases

Starting with 0.12 Flask-SQLAlchemy can easily connect to multiple databases. To achieve that it preconfigures SQLAlchemy to support multiple “binds”.

What are binds? In SQLAlchemy speak a bind is something that can execute SQL statements and is usually a connection or engine. In Flask-SQLAlchemy binds are always engines that are created for you automatically behind the scenes. Each of these engines is then associated with a short key (the bind key). This key is then used at model declaration time to associate a model with a specific engine.

If no bind key is specified for a model the default connection is used instead (as configured in the pipeline configuration file).


NOTE

The table names must be unique across all binds.


Example Configuration

The following configuration declares two database connections. The default one as well as one other :

databases:

  default:
    dialect: postgresql
    address: localhost:5432
    admin: admin:pwd
    user: user:pwd
    database: outflow

  another_db:
    dialect: postgresql
    address: localhost:5432
    admin: admin:pwd
    user: user:pwd
    database: another_db

Inside a model file:


Referring to Binds

In your model you specify the bind to use with the bind_key attribute:

class User(Model):
    __bind_key__ = 'another_db'
    id = Column(Integer, primary_key=True)
    username = Column(String(80), unique=True)

class TableWithoutBind(Model):
    id = Column(Integer, primary_key=True)
    some_field = Column(Integer)

Then to query this table there is nothing in particular to do. The sqlalchemy session in the pipeline context have access to all the tables:

context.session.query(User.username).all()
context.session.query(TableWithoutBind.some_field).all()