Integrating Files

With Invenio, you can attach files to records and use powerful REST APIs to upload and download files. You can also integrate previewers to nicely display the content of a file in a webpage and use APIs to deliver different types of images using the IIIF de facto standard.

Overview

The first step when setting up the files integration is to decide where the files should be stored. In Invenio, you can specify where files are stored by defining a Location: a location is a representation of your storage system. It can be, for example, a local folder in your machine or a remote storage. It has a name and a URI to define the base path to the files.

To access and manage files in a location, Invenio uses a Storage implementation: this defines how to access files in your location. In other words, it knows how files are stored, e.g. in a hierarchy of folders, and how to read and write them.

You can define multiple locations and storage. This is useful, for example, when you are dealing with online and offline file systems.

A single specific file is represented in Invenio by a FileInstance object: it defines its relative path to the Location where it is stored and also other useful properties such as its size or checksum.

To sum up, you can think of the Location, Storage and FileInstance as the physical representation of your file system(s).

Object storage

Invenio provides an abstraction of your physical file system with an object storage representation similar to Amazon S3. This allows great flexibility and easy to use APIs.

You can compare this implementation to a traditional file system where files are contained in folders. In Invenio object storage, ObjectVersions are contained in Buckets.

Buckets are sets of objects, they are uniquely identified by an ID and can define constraints on file sizes or quotas. A bucket also defines the default Location to use when adding files, if none is provided.

An ObjectVersion is a specific version (the version_id) of a file at a given moment in time. It has a reference to a FileInstance and metadata such as the file name (the key). The latest version of a file is marked as the head.

ObjectVersions allow to perform operations to your files without accessing the file system. Let’s see a common example: a user wants to delete a previously uploaded file. With Invenio, this means creating a new ObjectVersion for this file, the new head, without reference to the FileInstance that it had before. The file on disk is not accessed or removed. This is also called delete marker or soft deletion.

You can read more in the Invenio-Files-REST module documentation.

Integration with records

Invenio integrates files and records by creating a reference between a Record and a Bucket. By default in Invenio, a record has a reference to one bucket and a bucket to one record.

../../_images/invenio-files-integration.png

Invenio allows you to set up different scenarios and have, for example, multiple buckets per records and viceversa or buckets not attached to any record. To achieve any of these, you will have to define your own integration between records and files.

The records-files integration provides a set of REST APIs to easily add, retrieve or delete files for a given record.

Using REST APIs

If you haven’t already done so, make sure you’ve followed the Quickstart so you have an Invenio instance to work on.

Let’s create a simple record:

$ curl -k --header "Content-Type: application/json" \
    --request POST \
    --data '{"title":"Some title", "contributors": [{"name": "Doe, John"}]}' \
    https://127.0.0.1:5000/api/records/?prettyprint=1

Response:

{
  "created": "2019-11-25T15:02:24.379791+00:00",
  "id": "1",
  "links": {
    "files": "https://127.0.0.1:5000/api/records/1/files",
    "self": "https://127.0.0.1:5000/api/records/1"
  },
  "metadata": {
    "contributors": [
      {
        "name": "Doe, John"
      }
    ],
    "id": "1",
    "title": "Some title"
  },
  "revision": 0,
  "updated": "2019-11-25T15:02:24.379798+00:00"
}

You can now upload a file to this record (if you are not using the default scripts to run the server, make sure your celery worker is running):

$ echo "my file content" > my_file.txt

$ curl -k -H "Content-Type: application/octet-stream" \
    --request PUT \
    --data-binary @my_file.txt \
    https://127.0.0.1:5000/api/records/1/files/my_file.txt?prettyprint=1

Response:

{
  "is_head": true,
  "updated": "2019-11-25T15:21:07.276520",
  "size": 16,
  "version_id": "577a96b9-94a1-4abf-8f6a-a5c168ee6faa",
  "key": "my_file.txt",
  "tags": {},
  "links": {
    "self": "https://127.0.0.1:5000/api/records/1/files/my_file.txt",
    "version": "https://127.0.0.1:5000/api/records/1/files/my_file.txt?versionId=577a96b9-94a1-4abf-8f6a-a5c168ee6faa",
    "uploads": "https://127.0.0.1:5000/api/records/1/files/my_file.txt?uploads"
  },
  "mimetype": "text/plain",
  "created": "2019-11-25T15:21:07.269683",
  "delete_marker": false,
  "checksum": "md5:1b7ea8126d278ecbfa9fcb9b0d7dc5af"
}

If you now fetch the record again, you can see that the uploaded files have been added to its metadata:

$ curl -k --header "Content-Type: application/json" \
    https://localhost:5000/api/records/1?prettyprint=1

Response:

{
  "created": "2019-11-25T15:06:02.858325+00:00",
  "files": [
    {
      "bucket": "7ddc1409-35a3-4a65-8324-89da4245f2f9",
      "checksum": "md5:1b7ea8126d278ecbfa9fcb9b0d7dc5af",
      "file_id": "6f413750-82ca-45bb-aa5a-0f009b651843",
      "key": "my_file.txt",
      "size": 16,
      "version_id": "577a96b9-94a1-4abf-8f6a-a5c168ee6faa"
    }
  ],
  "id": "1",
  "links": {
    "files": "https://localhost:5000/api/records/1/files",
    "self": "https://localhost:5000/api/records/1"
  },
  "metadata": {
    "contributors": [
      {
        "name": "Doe, John"
      }
    ],
    "id": "1",
    "title": "Some title"
  },
  "revision": 2,
  "updated": "2019-11-25T15:21:07.453874+00:00"
}

You can download the file by requesting it with its filename:

$ curl -k --header "Content-Type: application/json" \
    https://localhost:5000/api/records/1/files/my_file.txt

You can also delete the uploaded file:

$ curl -k --header "Content-Type: application/json" \
    --request DELETE \
    https://localhost:5000/api/records/1/files/my_file.txt

Integration details

When creating a new record, Invenio automatically creates and assigns a new Bucket to the newly created record. Then, when a new file is uploaded to the record, Invenio will:

  1. fetch the Bucket assigned to the Record

  2. store the file in the bucket’s default Location using the configured Storage

  3. create a new FileInstance with size, checksum and the URI path pointing to the file

  4. create a new ObjectVersion with a reference to the FileInstance and the Bucket to which it belongs

  5. update the record’s metadata to add the metadata of the new file

You can learn more on how record and files work together as well as the available APIs by reading the documentation of Invenio-Records-Files and Invenio-Files-REST.

Setup your storage

With the quickstart application, a default Location is set up in the same directory of your virtual environment.

You can create your own locations by using the CLI. The only constraint is that you will always have to define at least one default Location.

For example, you can define a new default location named shared in the path /mnt/shared:

$ pipenv run invenio files location shared /mnt/shared --default

From now on, any new Bucket, besides the existing ones will use this location and therefore files will be stored in /mnt/shared.

Invenio provides a default storage implementation based on PyFilesystem and it will store files in the path <location_uri>/<file instance uuid>/data. The middle path <file instance uuid> can be adjusted via configuration variables.

For example, the previously uploaded file my_file.txt will be saved on disk in /mnt/shared/4j/0f/k7ss-h8k1-0k2h/data.

Note

Every file in Invenio is stored on disk with the file name data. This is to avoid any possible issue with user input and potentially unsupported special characters. The original file name is stored in the ObjectVersion metadata and this internal implementation is never exposed to the user.

Custom storage

The default storage implementation in Invenio uses PyFilesystem to access the file system. If this does not fulfill your requirements, you can implement your own.

The invenio_files_rest.storage.FileStorage is the base class interface that defines the operations used when accessing files. You can create your own factory that will instantiate and return your storage implementation.

def my_storage_factory(fileinstance=None, default_location=None,
                       default_storage_class=None,
                       filestorage_class=MyFileStorage, fileurl=None,
                       size=None, modified=None, clean_dir=True):
    fileurl = fileinstance.uri
    return filestorage_class(
        fileurl, size=size, modified=modified, clean_dir=clean_dir)

Then, you can configure Invenio to use this new storage by setting the related configuration variable in your config.py:

FILES_REST_STORAGE_FACTORY = "my_storage_factory"

If you are looking for an integration with a S3 object storage, you can read more about it on the Invenio-S3 documentation.

Permissions

Files permissions relies on Invenio-Access to allow configured users or roles to perform actions. These concepts are also described in the Managing access to records section.

The integration with records does not set any particular permission on files: it is your responsibility to decide how to give access to files based on your record.

The first step is to implement your own permission factory. As an example, let’s implement a factory that allows access to files only to the user that is owner of the record (the record should have a field owner).

from flask_principal import UserNeed
from invenio_access import Permission, superuser_access
from invenio_files_rest.models import Bucket, MultipartObject, ObjectVersion
from invenio_records import Record
from invenio_records_files.models import RecordsBuckets

def my_permission_factory(obj, action):
    """Given an action, return the permission for the given object.

    :param obj: An instance of :class:`invenio_files_rest.models.Bucket` or
        :class:`invenio_files_rest.models.ObjectVersion` or
        :class:`invenio_files_rest.models.MultipartObject` or ``None`` if
        the action is global.
    :param action: The required action.
    :raises RuntimeError: If the object is unknown.
    :returns: A :class:`invenio_access.permissions.Permission` instance.
    """
    # apply the same permission to any `action`
    # retrieve the bucket from the requested `obj`
    bucket_id = None
    if isinstance(obj, Bucket):
        bucket_id = str(obj.id)
    elif isinstance(obj, ObjectVersion) or isinstance(obj, MultipartObject):
        bucket_id = str(obj.bucket_id)

    if bucket_id is not None:
        # retrieve the record with this bucket attached
        # we assume that there is only one
        record_bucket = RecordsBuckets.query.filter_by(bucket_id=bucket_id).one_or_none()
        if record_bucket is not None:
            # retrieve the owner field
            record = Record.get_record(record_bucket.record_id)
            owner = record.get("owner")
            if owner:
                return Permission(UserNeed(record["owner"]))
    # allow only admins
    return Permission(superuser_access)

Then, configure Invenio to use this function when validating permissions by setting the related configuration variable in config.py:

FILES_REST_PERMISSION_FACTORY = "my_permission_factory"

Response codes

If the authorization for an action fails, Invenio will normally returns a 403 response code for authenticated users, 401 otherwise. For security reasons, when trying to retrieve an unauthorized file, it will return a 404 instead to hide the existence or non-existence of the file.

Large files upload

When trying to upload a large file, it might happen that your HTTP request aborts and returns a response code 413 (Request Entity Too Large). The maximum upload size is limited by the default configuration of Flask and most probably your web server.

You can adjust these configurations according to your needs.

For Flask, set the MAX_CONTENT_LENGTH configuration variable. Be aware that if the request does not specify a CONTENT_LENGTH, no data will be read.

$ app.config['MAX_CONTENT_LENGTH'] = 25 * 1024 * 1024  # bytes

Here an example to tune the configuration of Nginx. In case you use another web server, please consult its documentation.

http {
 ...
 client_max_body_size 25M;
}

Files integrity checks

To ensure that files in your file system are not damaged, it is recommended to set up files integrity checks. This consists in a periodical tasks that scan your files and re-compute each checksum by comparing it with the one calculated when uploaded. In case of mismatch, it will throw an exception.

Configure the task in your config.py:

CELERY_BEAT_SCHEDULE = {
    'file-checks': {
       'task': 'invenio_files_rest.tasks.schedule_checksum_verification',
       'schedule': timedelta(hours=1),
    }
}

Make sure that celery beat is running:

$ celery -A invenio_app.celery beat

When the task schedule_checksum_verification runs, it will retrieve a number of files to check based on a set of constraints in order to throttle the execution rate of the checks. For each file, it will then spawn the task verify_checksum to calculate the checksum.

Given that this task will constantly check files, it is recommended to schedule these tasks on a separate low priority queue.

Create a new queue called low in your config.py:

CELERY_TASK_ROUTES = {
    'invenio_files_rest.tasks.verify_checksum': {'queue': 'low'},
}

Then, spawn only one worker that will consume tasks sent to the low queue:

$ celery -A invenio_app.celery worker -l info -Q low

Previewing files

Invenio has support for previewing many of the most popular file formats including PDF, ZIP, Markdown, images and Jupyter Notebooks.

Given an ObjectVersion with a filename (the key field), Invenio will iterate through the available previewers and use the first matching the file extension contained in the filename. The ordered list of previewers can be configured via the configuration variable PREVIEWER_PREFERENCE.

For example, given a thesis.pdf to preview and the following configuration:

PREVIEWER_PREFERENCE = [
    "simple_image",  # previews .jpg and .png
    "a_pdf_previewer",  # previews .pdf
    "another_pdf_previewer",  # previews .pdf
]

only the a_pdf_previewer will be run as previewer. another_pdf_previewer will be never executed.

To preview a file in your website, you can use the available endpoint /records/<pid_value>/preview/<filename> and the view invenio_previewer.views:preview. In your config.py add:

RECORDS_UI_ENDPOINTS=dict(
    recid_previewer=dict(
        pid_type='recid',
        route='/records/<pid_value>/preview/<filename>',
        view_imp='invenio_previewer.views:preview',
        record_class='invenio_records_files.api:Record',
    ),
)

You see the list of available previewer and learn how to create your own previewer by reading the documentation of Invenio-Previewer.

Handling images using IIIF

Invenio implements the IIIF Image APIs, a de facto standard for delivering images on the web. It allows you to generate thumbnails, resize, zoom and preview images.

For example, you can resize on the fly images uploaded by the user to a dimension that best suites your website layout. This is very useful, for example, when displaying thumbnails in the list of search results.

Let’s say that you want to resize the large image large.png uploaded by the user to 640x480 pixels. You can use the available REST APIs and retrieve the image as the following:

# IIIF Image specification:                 /region/size/rotation/quality.format
/api/iiif/<bucket_id>:<version_id>:large.png/full/640,480/0/default.png

Let’s say that you now want to achieve the same when previewing the image in your website, and not via REST APIs. You can take advantage of the files preview and integrate IIIF with it.

Add the IIIF previewer iiif_image in your config.py:

PREVIEWER_PREFERENCE = [
    'iiif_image',
    'pdfjs',
    'zip',
]

and configure if to resize to your needs. In config.py:

IIIF_PREVIEWER_PARAMS = {
    'size': '640,480'
}

To learn more about the IIIF integration, see the Invenio-IIIF documentation.

Security

When serving files, you will have to take into account any security implication. Here you can find some recommendations to mitigate possible vulnerabilities, such as Cross-Site Scripting (XSS):

  1. If possible, serve user uploaded files from a separate domain (not a subdomain).

  2. By default, Invenio-Files-REST sets some response headers to prevent the browser from rendering and executing HTML files. For files that you consider safe and you need to have rendered, you can configure the MIMETYPE_WHITELIST. See send_stream for more information.

  3. Prefer file download instead of allowing the browser to preview any file, by adding the ?download URL query argument.

Next steps

You can have detailed information by reading the documentation of each module: