Skip to content

S3

S3 is widely used a data storage and magnus can use S3 as a catalogue storage.

Additional dependencies

Magnus extensions needs AWS capabilities via boto3 to use S3. You can install it via

pip install magnus_extensions[aws]

or via:

poetry add magnus_extensions[aws]

Configuration

The full configuration to use S3 as data catalog:

catalog:
  type: s3
  config:
    aws_profile: str # defaults to ''
    use_credentials: bool # defaults to False
    region: str # defaults to eu-west-1
    aws_credentials_file: str # defaults to str(Path.home() / '.aws' / 'credentials')
    aws_access_key_name: str # defaults to  'AWS_ACCESS_KEY_ID'
    aws_secret_access_key_name: str # defaults to 'AWS_SECRET_ACCESS_KEY'
    aws_session_key_name: str # defaults to 'AWS_SESSION_TOKEN'
    role_arn: str # defaults to ''
    session_duration_in_seconds: int # defaults to 900
    compute_data_folder : str # defaults to data/
    s3_bucket: str # Should be PROVIDED
    prefix: str # defaults to str
  • compute_data_folder:

The data folder that is used for your work.

Logically cataloging works as follows:

  • get files from the catalog before the execution to a specific compute data folder
  • execute the command
  • put the files from the compute data folder to the catalog.

You can over-ride the compute data folder, defined globally, for individual steps by providing it in the step configuration.

For example:

catalog:
  ...

dag:
  steps:
    step name:
      ...
      catalog:
        compute_data_folder: # optional and only apples to this step
        get:
          - list
        put:
          - list

    ...
  • s3_bucket:

The s3 bucket to use as a catalog

  • prefix:

The prefix to the path where the cataloging is done.

For example: if the prefix is catalog, then the catalog per run would be stored at:

<s3_bucket>/catalog/<run_id>/.

  • aws_profile:

    Defaults to '' or the default profile.

  • use_credentials:

    Defaults to False. It is always safer to use RBAC instead of credentials.

  • region:

    Defaults to eu-west-1. The AWS region you want a boto3 session to be instantiated.

  • aws_credentials_file:

    Defaults to str(Path.home() / '.aws' / 'credentials'). The file where AWS credentials are typically stored. This file is used in both use_credentials and by internally by boto3 while looking for profiles.

  • aws_access_key_name:

    Defaults to 'AWS_ACCESS_KEY_ID'. The environmental variable name that is to be used as aws access key, if you are using credentials.

  • aws_secret_access_key_name:

    Defaults to 'AWS_SECRET_ACCESS_KEY'. The environmental variable name that is used as AWS Secret access key, if you are using credentials.

  • aws_session_key_name:

    Defaults to 'AWS_SESSION_TOKEN' The environmental variable name that is used for AWS session token, if you are using credentials.

  • role_arn:

    Defaults to ''. The role to assume if you are using sessions.

  • session_duration_in_seconds:

    Defaults to 900 The duration of the AWS session.