Understanding data models¶

An Invenio data model, a bit simply put, defines a record type. You can also think of a data model as a supercharged database table.

In addition to storing records (aka documents or rows in a database) according to a specific structure, a data model also deals with:

Access to records via REST APIs and landing pages.
Internal storage, representation and retrieval of records and persistent identifiers.
Mapping external representations to/from the internal representation via loaders and serializers.

You can build data models that are both custom to your exact needs, or you can build data models that follow standard metadata formats such as Dublin Core, DataCite or MARC21. In fact, a data model does not put any restrictions on what you can store, except that a record must be stored internally as JSON.

You can build data models for classic digital repository use cases such as bibliographic and author records, but Invenio is in no way limited to these classic use cases, and you could as well build your geographical research database on top of Invenio.

First steps¶

First of all, make sure you have followed the Quickstart, to ensure you have scaffolded an initial Invenio instance and a data model package.

You should see a directory structure similar to the one below in the newly scaffolded data model package:

|-- ...
|-- docs
|   |-- ...
|-- my_site
|   |-- config.py
|   |-- records
|   |   |-- jsonschemas/
|   |   |-- loaders/
|   |   |-- mappings/
|   |   |-- marshmallow/
|   |   |-- serializers/
|   |   |-- static/
|   |   |-- templates/
|   |   `-- ...
|   `-- ...
|-- setup.py
`-- tests
    |-- ...

Steps

Building a data model involves the following tasks:

Internal representation
- Define a JSONSchema – used to validate the internal structure of your record.
- Define an Elasticsearch mapping – used to specify how your records are indexed by the search engine.
External representation
- Define serializers – transform an internal representation to an external (e.g. JSON to DataCite XML).
- Define loaders – transform and validates an external representation to an internal (e.g. DataCite XML to JSON).
- Define a Marshmallow schema – used to build loaders and serializers.
Exposing records via the UI and REST API
- Define templates – used to render search results and landing pages.
- Configure the UI – enables HTML landing pages for your records.
- Configure the REST API – enables the REST API for your records.

Define a JSONSchema¶

Internally records are stored as JSON, and in order to validate the structure of the stored JSON you must write a JSONSchema.

The scaffolded data model package includes an example of a simple JSONSchema, that you can use to get a feeling of what a JSONSchema looks like.

|-- my_site
|   |-- records
|   |   |-- jsonschemas
|   |   |   |-- __init__.py
|   |   |   `-- records
|   |   |       `-- record-v1.0.0.json

In record-v1.0.0.json you should see something like:

{
    "$schema": "http://json-schema.org/draft-04/schema#",
    "id": "https://localhost/schemas/records/record-v1.0.0.json",
    "type": "object",
    "properties": {
        "title": {
        "description": "Record title.",
        "type": "string"
        },
    }
}

Example record

An example record that validates against this schema could look like:

{
    "$schema": "https://localhost/schemas/records/record-v1.0.0.json",
    "title": "My record"
}

Note, that the $schema key points to the JSONSchema that the record should be validated against.

Discovery of schemas

Invenio is using standard Python entry points to discover your data model package’s JSONSchemas. Thus, you’ll see in the setup.py an entry point group invenio_jsonschemas.schemas:

setup(
    # ...
    entry_points={
        'invenio_jsonschemas.schemas': [
            'my_datamodel = my_datamodel.jsonschemas'
        ],
        # ...
    },
)

Note

A typical mistake is to forget to add a blank __init__.py file inside the jsonschemas folder, in which case the entry point won’t work.

Define an Elasticsearch mapping¶

In order to make records searchable, the records need to be indexed in Elasticsearch. Similarly to the JSONSchema that allows you to validate the structure of the JSON, you need to define an Elasticsearch mapping, that tells Elasticsearch how to index your document.

The scaffolded data model package includes an example of a simple Elasticsearch mapping

|-- my_site
|   |-- records
|   |   |-- mappings
|   |   |   |-- __init__.py
|   |   |   |-- v6
|   |   |   |   |-- __init__.py
|   |   |   |   `-- records
|   |   |   |       `-- record-v1.0.0.json
|   |   |   `-- v7
|   |   |       |-- __init__.py
|   |   |       `-- records
|   |   |           `-- record-v1.0.0.json

Note, you need an Elasticsearch mapping per major version of Elasticsearch you want to support.

In record-v1.0.0.json (for Elasticsearch 7) you should see something like:

{
    "mappings": {
        "date_detection": false,
        "numeric_detection": false,
        "properties": {
            "$schema": {
                "type": "text",
                "index": false
            },
            "title": {
                "type": "text",
            },
            "keywords": {
                "type": "keyword"
            },
        }
    }
}

The above Elasticsearch mapping, similarly to the JSONSchema, defines the structure of the JSON, but also how it should be indexed.

For instance, in the above example the title field is of type text, which applies stemming when searching, whereas the keywords field is of type keyword, which means no stemming is applied, therefore, this field is searched based on exact match. The mapping also allows you to define e.g. that a lat and a lon field are in fact geographical coordinates, and enable geospatial queries over your records.

Naming JSONSchemas and mappings¶

You may already have noticed that both JSONSchemas and Elasticsearch mappings are using the same folder structure and naming scheme:

|-- my_site
|   |-- records
|   |   |-- jsonschemas
|   |   |   |-- __init__.py
|   |   |   `-- records
|   |   |       `-- record-v1.0.0.json
|   |   |-- mappings
|   |   |   |-- __init__.py
|   |   |   `-- v7
|   |   |       |-- __init__.py
|   |   |       `-- records
|   |   |           `-- record-v1.0.0.json

The naming scheme is very important for three reasons:

Indexing of records
Data model evolution
Discovery of mappings

1. Indexing of records

Invenio will determine the Elasticsearch index for a given record, based on the record’s $schema key. For instance, given the following record:

{
    "$schema": "https://localhost/schemas/records/record-v1.0.0.json",
    "...": "..."
}

Invenio will send the above record to the records-record-v1.0.0 Elasticsearch index. Note, it’s possible to customize this behavior.

2. Data model evolution

Over time data models are likely to evolve. In many cases, you can simply make backward compatible changes to the existing JSONSchema and Elasticsearch mappings. In cases, where you change the data model in a backward incompatible way, you create a new JSONSchema and new mappings (e.g. record-v1.1.0.json)

|-- my_site
|   |-- records
|   |   |-- jsonschemas
|   |   |   |-- __init__.py
|   |   |   `-- records
|   |   |       `-- record-v1.0.0.json
|   |   |       `-- record-v1.1.0.json
|   |   |-- mappings
|   |   |   |-- __init__.py
|   |   |   `-- v7
|   |   |       |-- __init__.py
|   |   |       `-- records
|   |   |           `-- record-v1.0.0.json
|   |   |           `-- record-v1.1.0.json

This allows you to simultaneously store old and new records - i.e. you don’t have to take down your service for hours to migrate millions of records from one version to a new one.

Now of course, old records will be sent to the records-record-v1.0.0 index and new records will be sent to the records-record-v1.1.0 index. However, a special Elasticsearch index alias records is also created, that allows you to search over both old and new records, thus smoothly handling data model evolution.

3. Discovery of mappings

Invenio is using standard Python entry points to discover your data model package’s Elasticsearch mappings. Thus, you’ll see in the setup.py an entry point group invenio_search.mappings:

setup(
    # ...
    entry_points={
        'invenio_search.mappings': [
            'records = my_datamodel.mappings'
        ],
        # ...
    },
)

Note, that the left-hand-side of the entry point, records = my_datamodel.mappings, defines the folder name/index alias (i.e. records) and that the right-hand-side defines the Python import path to the mappings package.

Note

A typical mistake is to forget to add a blank __init__.py file inside the mappings, v6 and v7 folders, in which case the entry points won’t be correctly discovered.

Define a Marshmallow schema¶

Marhsmallow is a Python library that helps you write highly advanced serialization/deserialization/validation rules for your input/output data. You can think of Marshmallow schemas as akin to form validation.

Marshmallow use in Invenio is optional, but is usually very helpful when you go beyond purely structural data validation - e.g. validating one field given the value of another field.

In Invenio, the Marshmallow schemas are located in the marshmallow Python module. You may have multiple Marshmallow schemas depending on your serialization and deserialization needs.

|-- my_site
|   |-- records
|    |   |-- marshmallow
|    |   |   |-- __init__.py
|    |   |   `-- json.py

Below is a simplified example of a Marshmallow schema you could use in json.py (note, the scaffolded data model package, includes a more complete example):

from invenio_records_rest.schemas import StrictKeysMixin
from marshmallow import fields

class RecordSchemaV1(StrictKeysMixin):
    metadata = fields.Raw()
    created = fields.Str()
    revision = fields.Integer()
    updated = fields.Str()
    links = fields.Dict()
    id = fields.Str()

In Invenio the Marshmallow schemas are often used together with serializers and loaders, so continue reading to see how the schema is used.

What’s the difference: JSONSchemas, Mappings and Marshmallow?

It may seem a bit confusing that Invenio is dealing with three types of schemas. There’s however good reasons:

JSONSchema: Deals with the internal structural validation of records stored in the database (much like you define the table structure in database).
Elasticsearch mappings: Deals with how records are indexed in Elasticsearch which has big impact on your search results ranking.
Marshmallow schema: Deals with primarily data validation and transformation for both serialization and deserialization (think of it as form validation).

Define serializers¶

Think of serializers as the definition of your output formats for records. The serializers are responsible for transforming the internal JSON for a record into some external representation (e.g. another JSON format or XML).

Serializers are defined in the serializers module:

|-- my_site
|   |-- records
|   |   |-- serializers
|   |   |   `-- __init__.py

By default, Invenio provides serializers that can help you serialize your internal record into common formats such as JSON-LD, Dublin Core, DataCite, MARCXML, Citation Style Language.

Example

In the scaffolded data model package, there’s an example of a simple serializer:

from invenio_records_rest.serializers.json import \
    JSONSerializer
from invenio_records_rest.serializers.response import \
    record_responsify, search_responsify

from ..marshmallow import RecordSchemaV1

#: JSON serializer definition.
json_v1 = JSONSerializer(RecordSchemaV1, replace_refs=True)

#: Serializer for individual records.
json_v1_response = record_responsify(json_v1, 'application/json')
#: Serializer for search results.
json_v1_search = search_responsify(json_v1, 'application/json')

First, we create an instance of the JSONSerializer and provide it with our previously created Marshmallow schema. The marshmallow schema is used to transform the internal JSON prior to that the JSONSerializer dumps the actual JSON output. This allows you e.g. to evolve your internal data model, without affecting your REST API.

Next, we create two different response serializers: json_v1_response and json_v1_search. The former is responsible for producing an HTTP response for an individual record, while the latter is responsible for producing an HTTP response for a search result (i.e. multiple records).

The response serializer can not only output data to the HTTP response body, but can also add HTTP headers (e.g. Link headers).

You can see examples of the output from the two response serializers in the Quickstart section: Display a record and Search for records.

Define loaders¶

Think of loaders as the definition of your input formats for records. You only need loaders if you plan to allow creation of records via the REST API.

The loaders are responsible for transforming a request payload (external representation) into the internal JSON format.

Loaders are defined in the loaders module:

|-- my_site
|   |-- records
|    |   |-- loaders
|    |   |   `-- __init__.py

Loaders are defined in much the same way as serializers, and similarly you can use the Marshmallow schemas:

from invenio_records_rest.loaders.marshmallow import \
    marshmallow_loader
from ..marshmallow import MetadataSchemaV1

json_v1 = marshmallow_loader(MetadataSchemaV1)

Note, that you are not required to use Marshmallow for deserialization, but it allows you to use advanced data validation rules on your REST API.

Define templates¶

In order to display records not only on your REST API, but also provide search interface and landing pages for your record you need to provide templates that render your records.

You will need two different types of templates:

Search result template
Landing page template

The templates are stored in two different folders (static and templates):

|-- my_site
|   |-- records
|   |   |-- static
|   |   |   `-- templates
|   |   |       `-- my_datamodel
|   |   |           `-- results.html
|   |   |-- templates
|   |   |   `-- my_datamodel
|   |   |       `-- record.html

Search result template

The Invenio search interface is run by a JavaScript application, and thus the template is rendered client side in the user’s browser. The template uses data received by the REST API and thus your REST API must be able to deliver all information you would like to render in the template (your serializers are responsible for this).

The search results template is by default (it’s configurable) located in static/templates/my_datamodel/results.html and is using the Angular template syntax.

Landing page template

The landing page for a single record is rendered on the server-side using a Jinja template.

The landing page template is by default (it’s configurable) located in templates/my_datamodel/record.html and is using the Jinja template syntax.

Configure the UI¶

Last step after having defined all the different schemas, serializers, loaders and templates is to configure your REST API and landing pages for your records.

This is all done from the data model’s config.py:

|-- my_site
|   |-- records
|   |   |-- config.py

Note

Take care, not to confuse my_site/records/config.py (the data model’s module configuration) with my_site/config.py (your application’s configuration).

To avoid the application configuration file from growing very big, we usually keep the default configuration for a module in a config.py inside the module.

Landing page

Let’s start by configuring the landing page:

RECORDS_UI_ENDPOINTS = {
    'recid': {
        'pid_type': 'recid',
        'route': '/records/<pid_value>',
        'template': 'my_datamodel/record.html',
    },
}

Here an explanation of the different keys:

pid_type: Defines the persistent identifier type which the resolver should use to lookup records. Invenio provides an internal persistent identifier type called recid which is an auto-incrementing integer.
route: URL endpoint under which to expose the landing pages.
template: Template to use when rendering the landing page.
recid: Unique name of the endpoint. If this is the primary landing page, it must be named the same as the value of pid_type (i.e. recid).

Configure the REST API¶

Configuring the REST API is done similarly to the landing pages via the RECORDS_REST_ENDPOINTS configuration variable in config.py:

Persistent identifier type

First you provide the persistent identifier type used by the resolver. You also need to configure a persistent identifier minter and fetcher. In the scaffolded data model package, you are just using the already provided recid minter and fetchers.

A minter is responsible for generating a new persistent identifier for your record, while a fetcher is responsible for extracting the persistent identifier from your search results:

RECORDS_REST_ENDPOINTS = {
    'recid': dict(
        pid_type='recid',
        pid_minter='recid',
        pid_fetcher='recid',
        # ...
    ),
}

Search

Next, you define the Elasticsearch index to use for searches. The index is defined as records because this is the index alias which was created for our mappings records/record-v1.0.0.json (see Naming JSONSchemas and mappings).

RECORDS_REST_ENDPOINTS = {
    'recid': dict(
        # ...
        search_index='records',
    ),
}

Serializers

Next, you define which serializers to use. Invenio is using HTTP Content Negotiation to choose your serializer. You have to specify the serializer for individual records in record_serializers and the serializers for search results in search_serializers:

RECORDS_REST_ENDPOINTS = {
    'recid': dict(
        # ...
        record_serializers={
            'application/json': (
                'my_datamodel.serializers:json_v1_response'),
        },
        search_serializers={
            'application/json': (
                'my_datamodel.serializers:json_v1_search'),
        },
    ),
}

Loaders

Next, you define the loaders to use. Similar to the serializers the loaders are selected based on HTTP Content Negotiation.

RECORDS_REST_ENDPOINTS = {
    'recid': dict(
        # ...
        record_loaders={
            'application/json': (
                'my_datamodel.loaders:json_v1'),
        },
    ),
}

URL routes

Last you define the URL routes under which to expose your records:

RECORDS_REST_ENDPOINTS = {
    'recid': dict(
        # ...
        list_route='/records/',
        item_route='/records/<pid(recid):pid_value>',
    ),
}

Next steps¶

Above is a quick walk through of the different steps to build a data model. In order to get more details on individual topics we suggest further reading:

Understanding data models¶

First steps¶

Define a JSONSchema¶

Define an Elasticsearch mapping¶

Naming JSONSchemas and mappings¶

Define a Marshmallow schema¶

Define serializers¶

Define loaders¶

Define templates¶

Configure the UI¶

Configure the REST API¶

Next steps¶

Navigation

Related Topics