Nextflow & Wave CLI need to automatically retrieve and save the Wave build and scan logs to S3 bucket | Voters

Nextflow & Wave CLI need to automatically retrieve and save the Wave build and scan logs to S3 bucket

acknowledged

Charcoal Mandrill

Our internal enterprise requirements include the requirement for all containers running in our infrastructure to have build logs and security scan logs saved. When using Wave + Fusion, our internal ECR-hosted containers (which have build and scan logs already saved) get replaced with Wave containers.

Wave containers have build and scan logs available, but they are hosted externally on the Wave server. In order to retrieve them, you must have a-priori knowledge of the Wave container build ID's and API endpoints to do a 'curl' request to retrieve the logs (

curl https://wave.seqera.io/v1alpha1/scans/<scan_id>

) . Then, the user must manually save the logs somewhere.

Instead of writing our own custom code to try to automatically retrieve and store all the build and scan logs for all Wave containers used by Nextflow when running with Wave + Fusion, it would be much easier if Nextflow itself could just save them via some configurations.

Consider for example the current pre-existing configuration for storing the Wave container in our ECR;

https://www.nextflow.io/docs/latest/wave.html

wave.build.repository = '123456789.dkr.ecr.eu-west-1.amazonaws.com/wave/build'
wave.build.cacheRepository = '123456789.dkr.ecr.eu-west-1.amazonaws.com/wave/cache'

it seems like we should be able to use this same method to also store the Wave logs, like this

wave.logs.buildLog = 's3://my-bucket/systems/ecr/wave-logs'
wave.logs.scanLog = 's3://my-bucket/systems/ecr/wave-logs'

Also note that the current wave cli tool has similar capabilities;

wave --freeze -i ubuntu:22.04 --build-repo 123456789.dkr.ecr.us-east-1.amazonaws.com/wave/build

so I would hope this feature could also extend to that as well with something like

wave --freeze -i ubuntu:22.04 --build-repo 123456789.dkr.ecr.us-east-1.amazonaws.com/wave/build --build-log s3://my-bucket/systems/ecr/wave-logs --scan-log s3://my-bucket/systems/ecr/wave-logs

Also related; https://github.com/seqeralabs/wave/issues/608

September 18, 2024

Rob Newman

marked this post as

acknowledged

Fellow Bee

What it's not convincing in the proposal of saving build and scan logs to a s3 bucket is:

need for customer credentials to access it
you will have a "blob" on files identified by a build/scan code, and ultimately the same problem to associate a workflow run with the corresponding container builds/scans
we don't store the scan raw log, instead we track the vulnerabilities for each container image

However I understand your point of view. What if we made

wave

aware of the workflow id, and make it possible to access containers metadata trough it (and platform as well)?

Charcoal Mandrill

Fellow Bee For the customer credentials, both the Nextflow and wave cli methods are already able to (somehow) leverage some credentials (which?) in order to push copies of the container to our internal AWS ECR. Its my hope that this same method could be used to also give access to the S3 bucket as well.

the association of the worfklow run with the corresponding container build / scan logs is definitely something that would need to be figured out. On the Nextflow side, it seems like something that 'nf-prov' might be able to help with possibly? Not sure if that is feasible or not. However as it stands, there is currently no way to associate the Nextflow run itself with the containers left in the 'wave/build' and 'wave/cache' repos in our ECR either. (please let me know if this is not the case) So I am not sure that the lack of this association between a workflow run and the logs + containers in the S3 + ECR would be a new Issue or just the perpetuation of the status quo. Possibly worth a different solution.

for the scan logs, whatever is currently available from the API is likely sufficient I think. There's an example here;

curl https://wave.seqera.io/v1alpha1/scans/63b461e8b024 | jq

this looks fine to me

Charcoal Mandrill

Fellow Bee
(cont.)
> What if we made wave aware of the workflow id, and make it possible to access containers metadata trough it (and platform as well)?
I am not sure how much changes, if any, need to be made to the back-end 'wave' service (https://github.com/seqeralabs/wave). 
It seems more useful to me if Nextflow (https://www.nextflow.io/docs/latest/wave.html#wave-containers) and wave cli (https://github.com/seqeralabs/wave-cli) both were able to do the log retrievals automatically. 
I mention Nextflow and wave cli in particular, because we still need to support users who will be running pipelines from the cli and not from within Seqera Platform. Both of these are already in use by us and users outside of Platform, and technically our Infosec requirements do extend to the usage of containers on our servers and ParallelCluster regardless of if Seqera Platform is involved or not. So it seems like a better target for these two systems to bundle this functionality themselves. Also worth considering that, if the end-user is required to go out of their way to perform some extra manual actions to retrieve these logs, they are likely to simply forget or otherwise neglect to do it, which would also put us in a more difficult situation. 
thanks! :)

Fellow Bee

Charcoal Mandrill thinking more about this, I believe to proper way to collect and report this information is via Platform both for Platform based execution and Nextflow CLI runs (using -with-tower option).

Ultimately Platform has been designed to allow users to monitoring and track executions metadata for both use cases. Would that work in your case?

Charcoal Mandrill

Fellow Bee that is a very interesting idea. I dont think I have ever tried using the -with-tower option in our current enterprise deployment context. I am gonna put this on our to-do list to explore. Are there perhaps any details or docs available on how exactly this option works with the current versions of enterprise Seqera Platform? I think the last time I used -with-tower was in 2018-2019 on public or other custom on-prem deployments (non-enterprise). In particular I am not really clear how this is going to work with our user and workspace configurations & permissions, and what is actually going to happen on the Platform side with this option enabled.

also, I am guessing there may be a way to enable this from within the Nextflow config file too? So that we might be able to just embed it in a config we could provide to users

Fellow Bee

Charcoal Mandrill both nextflow and tower (seqera platform) cli uses the same env variables

export TOWER_ACCESS_TOKEN=<your token>
export TOWER_WORKSPACE=<the target workspace id (optional)>
export TOWER_API_ENDPOINT=<your server api endpoint>