.. This file is part of Invenio. Copyright (C) 2019 CERN. Invenio is free software; you can redistribute it and/or modify it under the terms of the MIT License; see LICENSE file for more details. .. _integrating-files: Integrating Files ================= With Invenio, you can attach files to records and use powerful REST APIs to upload and download files. You can also integrate previewers to nicely display the content of a file in a webpage and use APIs to deliver different types of images using the IIIF de facto standard. .. _handling-files-overview: Overview -------- The first step when setting up the files integration is to decide where the files should be stored. In Invenio, you can specify where files are stored by defining a ``Location``: a location is a representation of your storage system. It can be, for example, a local folder in your machine or a remote storage. It has a ``name`` and a ``URI`` to define the base path to the files. To access and manage files in a location, Invenio uses a ``Storage`` implementation: this defines how to access files in your location. In other words, it knows how files are stored, e.g. in a hierarchy of folders, and how to read and write them. You can define multiple locations and storage. This is useful, for example, when you are dealing with online and offline file systems. A single specific file is represented in Invenio by a ``FileInstance`` object: it defines its relative path to the Location where it is stored and also other useful properties such as its size or checksum. To sum up, you can think of the Location, Storage and FileInstance as the physical representation of your file system(s). Object storage ++++++++++++++ Invenio provides an abstraction of your physical file system with an object storage representation similar to Amazon S3. This allows great flexibility and easy to use APIs. You can compare this implementation to a traditional file system where files are contained in folders. In Invenio object storage, ``ObjectVersions`` are contained in ``Buckets``. Buckets are sets of objects, they are uniquely identified by an ID and can define constraints on file sizes or quotas. A bucket also defines the default Location to use when adding files, if none is provided. An ObjectVersion is a specific version (the ``version_id``) of a file at a given moment in time. It has a reference to a FileInstance and metadata such as the file name (the ``key``). The latest version of a file is marked as the ``head``. ObjectVersions allow to perform operations to your files without accessing the file system. Let's see a common example: a user wants to delete a previously uploaded file. With Invenio, this means creating a new ObjectVersion for this file, the new head, without reference to the FileInstance that it had before. The file on disk is not accessed or removed. This is also called delete marker or soft deletion. You can read more in the `Invenio-Files-REST `__ module documentation. .. _handling-files-integration-with-records: Integration with records ------------------------ Invenio integrates files and records by creating a reference between a Record and a Bucket. By default in Invenio, a record has a reference to one bucket and a bucket to one record. .. image:: ../_static/invenio-files-integration.png Invenio allows you to set up different scenarios and have, for example, multiple buckets per records and viceversa or buckets not attached to any record. To achieve any of these, you will have to define your own integration between records and files. The records-files integration provides a set of REST APIs to easily add, retrieve or delete files for a given record. .. _handling-files-using-rest-apis: Using REST APIs --------------- If you haven't already done so, make sure you've followed the :ref:`quickstart` so you have an Invenio instance to work on. Let's create a simple record: .. code-block:: console $ curl -k --header "Content-Type: application/json" \ --request POST \ --data '{"title":"Some title", "contributors": [{"name": "Doe, John"}]}' \ https://127.0.0.1:5000/api/records/?prettyprint=1 Response: .. code-block:: json { "created": "2019-11-25T15:02:24.379791+00:00", "id": "1", "links": { "files": "https://127.0.0.1:5000/api/records/1/files", "self": "https://127.0.0.1:5000/api/records/1" }, "metadata": { "contributors": [ { "name": "Doe, John" } ], "id": "1", "title": "Some title" }, "revision": 0, "updated": "2019-11-25T15:02:24.379798+00:00" } You can now upload a file to this record (if you are not using the default scripts to run the server, make sure your celery worker is running): .. code-block:: console $ echo "my file content" > my_file.txt $ curl -k -H "Content-Type: application/octet-stream" \ --request PUT \ --data-binary @my_file.txt \ https://127.0.0.1:5000/api/records/1/files/my_file.txt?prettyprint=1 Response: .. code-block:: json { "is_head": true, "updated": "2019-11-25T15:21:07.276520", "size": 16, "version_id": "577a96b9-94a1-4abf-8f6a-a5c168ee6faa", "key": "my_file.txt", "tags": {}, "links": { "self": "https://127.0.0.1:5000/api/records/1/files/my_file.txt", "version": "https://127.0.0.1:5000/api/records/1/files/my_file.txt?versionId=577a96b9-94a1-4abf-8f6a-a5c168ee6faa", "uploads": "https://127.0.0.1:5000/api/records/1/files/my_file.txt?uploads" }, "mimetype": "text/plain", "created": "2019-11-25T15:21:07.269683", "delete_marker": false, "checksum": "md5:1b7ea8126d278ecbfa9fcb9b0d7dc5af" } If you now fetch the record again, you can see that the uploaded files have been added to its metadata: .. code-block:: console $ curl -k --header "Content-Type: application/json" \ https://localhost:5000/api/records/1?prettyprint=1 Response: .. code-block:: json { "created": "2019-11-25T15:06:02.858325+00:00", "files": [ { "bucket": "7ddc1409-35a3-4a65-8324-89da4245f2f9", "checksum": "md5:1b7ea8126d278ecbfa9fcb9b0d7dc5af", "file_id": "6f413750-82ca-45bb-aa5a-0f009b651843", "key": "my_file.txt", "size": 16, "version_id": "577a96b9-94a1-4abf-8f6a-a5c168ee6faa" } ], "id": "1", "links": { "files": "https://localhost:5000/api/records/1/files", "self": "https://localhost:5000/api/records/1" }, "metadata": { "contributors": [ { "name": "Doe, John" } ], "id": "1", "title": "Some title" }, "revision": 2, "updated": "2019-11-25T15:21:07.453874+00:00" } You can download the file by requesting it with its filename: .. code-block:: console $ curl -k --header "Content-Type: application/json" \ https://localhost:5000/api/records/1/files/my_file.txt You can also delete the uploaded file: .. code-block:: console $ curl -k --header "Content-Type: application/json" \ --request DELETE \ https://localhost:5000/api/records/1/files/my_file.txt Integration details +++++++++++++++++++ When creating a new record, Invenio automatically creates and assigns a new Bucket to the newly created record. Then, when a new file is uploaded to the record, Invenio will: 1. fetch the Bucket assigned to the Record 2. store the file in the bucket's default Location using the configured Storage 3. create a new FileInstance with size, checksum and the URI path pointing to the file 4. create a new ObjectVersion with a reference to the FileInstance and the Bucket to which it belongs 5. update the record's metadata to add the metadata of the new file You can learn more on how record and files work together as well as the available APIs by reading the documentation of `Invenio-Records-Files `__ and `Invenio-Files-REST `__. .. _handling-files-setup-your-storage: Setup your storage ------------------ With the quickstart application, a default Location is set up in the same directory of your virtual environment. You can create your own locations by using the CLI. The only constraint is that you will always have to define at least one ``default`` Location. For example, you can define a new default location named ``shared`` in the path ``/mnt/shared``: .. code-block:: console $ pipenv run invenio files location shared /mnt/shared --default From now on, any new Bucket, besides the existing ones will use this location and therefore files will be stored in ``/mnt/shared``. Invenio provides a default storage implementation based on `PyFilesystem `__ and it will store files in the path ``//data``. The middle path ```` can be adjusted via configuration variables. For example, the previously uploaded file ``my_file.txt`` will be saved on disk in ``/mnt/shared/4j/0f/k7ss-h8k1-0k2h/data``. .. note:: Every file in Invenio is stored on disk with the file name ``data``. This is to avoid any possible issue with user input and potentially unsupported special characters. The original file name is stored in the ObjectVersion metadata and this internal implementation is never exposed to the user. Custom storage ++++++++++++++ The default storage implementation in Invenio uses `PyFilesystem `__ to access the file system. If this does not fulfill your requirements, you can implement your own. The :py:class:`invenio_files_rest.storage.FileStorage` is the base class interface that defines the operations used when accessing files. You can create your own factory that will instantiate and return your storage implementation. .. code-block:: python def my_storage_factory(fileinstance=None, default_location=None, default_storage_class=None, filestorage_class=MyFileStorage, fileurl=None, size=None, modified=None, clean_dir=True): fileurl = fileinstance.uri return filestorage_class( fileurl, size=size, modified=modified, clean_dir=clean_dir) Then, you can configure Invenio to use this new storage by setting the related configuration variable in your ``config.py``: .. code-block:: python FILES_REST_STORAGE_FACTORY = "my_storage_factory" If you are looking for an integration with a S3 object storage, you can read more about it on the `Invenio-S3 `__ documentation. .. _handling-files-permissions: Permissions ----------- Files permissions relies on `Invenio-Access `__ to allow configured users or roles to perform actions. These concepts are also described in the :ref:`managing-access` section. The integration with records does not set any particular permission on files: it is your responsibility to decide how to give access to files based on your record. The first step is to implement your own permission factory. As an example, let's implement a factory that allows access to files only to the user that is owner of the record (the record should have a field ``owner``). .. code-block:: python from flask_principal import UserNeed from invenio_access import Permission, superuser_access from invenio_files_rest.models import Bucket, MultipartObject, ObjectVersion from invenio_records import Record from invenio_records_files.models import RecordsBuckets def my_permission_factory(obj, action): """Given an action, return the permission for the given object. :param obj: An instance of :class:`invenio_files_rest.models.Bucket` or :class:`invenio_files_rest.models.ObjectVersion` or :class:`invenio_files_rest.models.MultipartObject` or ``None`` if the action is global. :param action: The required action. :raises RuntimeError: If the object is unknown. :returns: A :class:`invenio_access.permissions.Permission` instance. """ # apply the same permission to any `action` # retrieve the bucket from the requested `obj` bucket_id = None if isinstance(obj, Bucket): bucket_id = str(obj.id) elif isinstance(obj, ObjectVersion) or isinstance(obj, MultipartObject): bucket_id = str(obj.bucket_id) if bucket_id is not None: # retrieve the record with this bucket attached # we assume that there is only one record_bucket = RecordsBuckets.query.filter_by(bucket_id=bucket_id).one_or_none() if record_bucket is not None: # retrieve the owner field record = Record.get_record(record_bucket.record_id) owner = record.get("owner") if owner: return Permission(UserNeed(record["owner"])) # allow only admins return Permission(superuser_access) Then, configure Invenio to use this function when validating permissions by setting the related configuration variable in ``config.py``: .. code-block:: python FILES_REST_PERMISSION_FACTORY = "my_permission_factory" Response codes ++++++++++++++ If the authorization for an action fails, Invenio will normally returns a ``403`` response code for authenticated users, ``401`` otherwise. For security reasons, when trying to retrieve an unauthorized file, it will return a ``404`` instead to hide the existence or non-existence of the file. .. _handling-files-upload-large-files: Large files upload ------------------ When trying to upload a large file, it might happen that your HTTP request aborts and returns a response code :code:`413 (Request Entity Too Large)`. The maximum upload size is limited by the default configuration of Flask and most probably your web server. You can adjust these configurations according to your needs. For Flask, set the :code:`MAX_CONTENT_LENGTH` configuration variable. Be aware that if the request does not specify a :code:`CONTENT_LENGTH`, no data will be read. .. code-block:: console $ app.config['MAX_CONTENT_LENGTH'] = 25 * 1024 * 1024 # bytes Here an example to tune the configuration of ``Nginx``. In case you use another web server, please consult its documentation. .. code-block:: console http { ... client_max_body_size 25M; } .. _handling-files-integrity-checks: Files integrity checks ---------------------- To ensure that files in your file system are not damaged, it is recommended to set up files integrity checks. This consists in a periodical tasks that scan your files and re-compute each checksum by comparing it with the one calculated when uploaded. In case of mismatch, it will throw an exception. Configure the task in your ``config.py``: .. code-block:: python CELERY_BEAT_SCHEDULE = { 'file-checks': { 'task': 'invenio_files_rest.tasks.schedule_checksum_verification', 'schedule': timedelta(hours=1), } } Make sure that `celery beat `_ is running: .. code-block:: console $ celery -A invenio_app.celery beat When the task `schedule_checksum_verification `_ runs, it will retrieve a number of files to check based on a set of constraints in order to throttle the execution rate of the checks. For each file, it will then spawn the task `verify_checksum `_ to calculate the checksum. Given that this task will constantly check files, it is recommended to schedule these tasks on a separate low priority queue. Create a new queue called ``low`` in your ``config.py``: .. code-block:: python CELERY_TASK_ROUTES = { 'invenio_files_rest.tasks.verify_checksum': {'queue': 'low'}, } Then, spawn only one worker that will consume tasks sent to the ``low`` queue: .. code-block:: console $ celery -A invenio_app.celery worker -l info -Q low .. _handling-files-previewing: Previewing files ---------------- Invenio has support for previewing many of the most popular file formats including PDF, ZIP, Markdown, images and Jupyter Notebooks. Given an ObjectVersion with a filename (the ``key`` field), Invenio will iterate through the available previewers and use the first matching the file extension contained in the filename. The ordered list of previewers can be configured via the configuration variable `PREVIEWER_PREFERENCE `_. For example, given a ``thesis.pdf`` to preview and the following configuration: .. code-block:: python PREVIEWER_PREFERENCE = [ "simple_image", # previews .jpg and .png "a_pdf_previewer", # previews .pdf "another_pdf_previewer", # previews .pdf ] only the ``a_pdf_previewer`` will be run as previewer. ``another_pdf_previewer`` will be never executed. To preview a file in your website, you can use the available endpoint ``/records//preview/`` and the view ``invenio_previewer.views:preview``. In your ``config.py`` add: .. code:: python RECORDS_UI_ENDPOINTS=dict( recid_previewer=dict( pid_type='recid', route='/records//preview/', view_imp='invenio_previewer.views:preview', record_class='invenio_records_files.api:Record', ), ) You see the list of available previewer and learn how to create your own previewer by reading the documentation of `Invenio-Previewer `__. .. _handling-files-iiif: Handling images using IIIF -------------------------- Invenio implements the `IIIF Image APIs `_, a de facto standard for delivering images on the web. It allows you to generate thumbnails, resize, zoom and preview images. For example, you can resize on the fly images uploaded by the user to a dimension that best suites your website layout. This is very useful, for example, when displaying thumbnails in the list of search results. Let's say that you want to resize the large image ``large.png`` uploaded by the user to ``640x480`` pixels. You can use the available REST APIs and retrieve the image as the following: .. code-block:: console # IIIF Image specification: /region/size/rotation/quality.format /api/iiif/::large.png/full/640,480/0/default.png Let's say that you now want to achieve the same when previewing the image in your website, and not via REST APIs. You can take advantage of the files preview and integrate IIIF with it. Add the IIIF previewer ``iiif_image`` in your ``config.py``: .. code-block:: python PREVIEWER_PREFERENCE = [ 'iiif_image', 'pdfjs', 'zip', ] and configure if to resize to your needs. In ``config.py``: .. code-block:: python IIIF_PREVIEWER_PARAMS = { 'size': '640,480' } To learn more about the IIIF integration, see the `Invenio-IIIF `__ documentation. .. _handling-files-security: Security -------- When serving files, you will have to take into account any security implication. Here you can find some recommendations to mitigate possible vulnerabilities, such as Cross-Site Scripting (XSS): 1. If possible, serve user uploaded files from a separate domain (not a subdomain). 2. By default, Invenio-Files-REST sets some response headers to prevent the browser from rendering and executing HTML files. For files that you consider safe and you need to have rendered, you can configure the `MIMETYPE_WHITELIST `_. See `send_stream `_ for more information. 3. Prefer file download instead of allowing the browser to preview any file, by adding the :code:`?download` URL query argument. Next steps ---------- You can have detailed information by reading the documentation of each module: - `Invenio-Files-REST `__ - `Invenio-Records-Files `__ - `Invenio-Previewer `__ - `Invenio-IIIF `__