S3¶
S3 is widely used a data storage and magnus can use S3 as a catalogue storage.
Additional dependencies¶
Magnus extensions needs AWS capabilities via boto3 to use S3. You can install it via
pip install magnus_extensions[aws]
or via:
poetry add magnus_extensions[aws]
Configuration¶
The full configuration to use S3 as data catalog:
catalog:
type: s3
config:
aws_profile: str # defaults to ''
use_credentials: bool # defaults to False
region: str # defaults to eu-west-1
aws_credentials_file: str # defaults to str(Path.home() / '.aws' / 'credentials')
aws_access_key_name: str # defaults to 'AWS_ACCESS_KEY_ID'
aws_secret_access_key_name: str # defaults to 'AWS_SECRET_ACCESS_KEY'
aws_session_key_name: str # defaults to 'AWS_SESSION_TOKEN'
role_arn: str # defaults to ''
session_duration_in_seconds: int # defaults to 900
compute_data_folder : str # defaults to data/
s3_bucket: str # Should be PROVIDED
prefix: str # defaults to str
-
compute_data_folder:¶
The data
folder that is used for your work.
Logically cataloging works as follows:
- get files from the catalog before the execution to a specific compute data folder
- execute the command
- put the files from the compute data folder to the catalog.
You can over-ride the compute data folder, defined globally, for individual steps by providing it in the step configuration.
For example:
catalog:
...
dag:
steps:
step name:
...
catalog:
compute_data_folder: # optional and only apples to this step
get:
- list
put:
- list
...
-
s3_bucket:¶
The s3 bucket to use as a catalog
-
prefix:¶
The prefix to the path where the cataloging is done.
For example: if the prefix is catalog
, then the catalog per run would be stored at:
<s3_bucket>/catalog/<run_id>/
.
-
aws_profile:¶
Defaults to '' or the
default
profile. -
use_credentials:¶
Defaults to False. It is always safer to use RBAC instead of credentials.
-
region:¶
Defaults to eu-west-1. The AWS region you want a boto3 session to be instantiated.
-
aws_credentials_file:¶
Defaults to str(Path.home() / '.aws' / 'credentials'). The file where AWS credentials are typically stored. This file is used in both use_credentials and by internally by boto3 while looking for profiles.
-
aws_access_key_name:¶
Defaults to 'AWS_ACCESS_KEY_ID'. The environmental variable name that is to be used as aws access key, if you are using credentials.
-
aws_secret_access_key_name:¶
Defaults to 'AWS_SECRET_ACCESS_KEY'. The environmental variable name that is used as AWS Secret access key, if you are using credentials.
-
aws_session_key_name:¶
Defaults to 'AWS_SESSION_TOKEN' The environmental variable name that is used for AWS session token, if you are using credentials.
-
role_arn:¶
Defaults to ''. The role to assume if you are using sessions.
-
session_duration_in_seconds:¶
Defaults to 900 The duration of the AWS session.