Nextflow & Wave CLI need to automatically retrieve and save the Wave build and scan logs to S3 bucket
acknowledged
F
Flamingo pink Python
Our internal enterprise requirements include the requirement for all containers running in our infrastructure to have build logs and security scan logs saved. When using Wave + Fusion, our internal ECR-hosted containers (which have build and scan logs already saved) get replaced with Wave containers.
Wave containers have build and scan logs available, but they are hosted externally on the Wave server. In order to retrieve them, you must have a-priori knowledge of the Wave container build ID's and API endpoints to do a 'curl' request to retrieve the logs (
curl https://wave.seqera.io/v1alpha1/scans/<scan_id>
) . Then, the user must manually save the logs somewhere.Instead of writing our own custom code to try to automatically retrieve and store all the build and scan logs for all Wave containers used by Nextflow when running with Wave + Fusion, it would be much easier if Nextflow itself could just save them via some configurations.
Consider for example the current pre-existing configuration for storing the Wave container in our ECR;
wave.build.repository = '123456789.dkr.ecr.eu-west-1.amazonaws.com/wave/build'
wave.build.cacheRepository = '123456789.dkr.ecr.eu-west-1.amazonaws.com/wave/cache'
it seems like we should be able to use this same method to also store the Wave logs, like this
wave.logs.buildLog = 's3://my-bucket/systems/ecr/wave-logs'
wave.logs.scanLog = 's3://my-bucket/systems/ecr/wave-logs'
Also note that the current wave cli tool has similar capabilities;
wave --freeze -i ubuntu:22.04 --build-repo 123456789.dkr.ecr.us-east-1.amazonaws.com/wave/build
so I would hope this feature could also extend to that as well with something like
wave --freeze -i ubuntu:22.04 --build-repo 123456789.dkr.ecr.us-east-1.amazonaws.com/wave/build --build-log s3://my-bucket/systems/ecr/wave-logs --scan-log s3://my-bucket/systems/ecr/wave-logs
- Also related; https://github.com/seqeralabs/wave/issues/608
Rob Newman
acknowledged
F
Fellow Bee
What it's not convincing in the proposal of saving build and scan logs to a s3 bucket is:
- need for customer credentials to access it
- you will have a "blob" on files identified by a build/scan code, and ultimately the same problem to associate a workflow run with the corresponding container builds/scans
- we don't store the scan raw log, instead we track the vulnerabilities for each container image
However I understand your point of view. What if we made
wave
aware of the workflow id, and make it possible to access containers metadata trough it (and platform as well)?F
Flamingo pink Python
Fellow Bee For the customer credentials, both the Nextflow and wave cli methods are already able to (somehow) leverage some credentials (which?) in order to push copies of the container to our internal AWS ECR. Its my hope that this same method could be used to also give access to the S3 bucket as well.
the association of the worfklow run with the corresponding container build / scan logs is definitely something that would need to be figured out. On the Nextflow side, it seems like something that 'nf-prov' might be able to help with possibly? Not sure if that is feasible or not. However as it stands, there is currently no way to associate the Nextflow run itself with the containers left in the 'wave/build' and 'wave/cache' repos in our ECR either. (please let me know if this is not the case) So I am not sure that the lack of this association between a workflow run and the logs + containers in the S3 + ECR would be a new Issue or just the perpetuation of the status quo. Possibly worth a different solution.
for the scan logs, whatever is currently available from the API is likely sufficient I think. There's an example here;
curl https://wave.seqera.io/v1alpha1/scans/63b461e8b024 | jq
this looks fine to me
F
Flamingo pink Python
Fellow Bee
(cont.)
> What if we made wave aware of the workflow id, and make it possible to access containers metadata trough it (and platform as well)?
I am not sure how much changes, if any, need to be made to the back-end 'wave' service (https://github.com/seqeralabs/wave).
It seems more useful to me if Nextflow (https://www.nextflow.io/docs/latest/wave.html#wave-containers) and wave cli (https://github.com/seqeralabs/wave-cli) both were able to do the log retrievals automatically.
I mention Nextflow and wave cli in particular, because we still need to support users who will be running pipelines from the cli and not from within Seqera Platform. Both of these are already in use by us and users outside of Platform, and technically our Infosec requirements do extend to the usage of containers on our servers and ParallelCluster regardless of if Seqera Platform is involved or not. So it seems like a better target for these two systems to bundle this functionality themselves. Also worth considering that, if the end-user is required to go out of their way to perform some extra manual actions to retrieve these logs, they are likely to simply forget or otherwise neglect to do it, which would also put us in a more difficult situation.
thanks! :)
F
Fellow Bee
Flamingo pink Python thinking more about this, I believe to proper way to collect and report this information is via Platform both for Platform based execution and Nextflow CLI runs (using -with-tower option).
Ultimately Platform has been designed to allow users to monitoring and track executions metadata for both use cases. Would that work in your case?
F
Flamingo pink Python
Fellow Bee that is a very interesting idea. I dont think I have ever tried using the -with-tower option in our current enterprise deployment context. I am gonna put this on our to-do list to explore. Are there perhaps any details or docs available on how exactly this option works with the current versions of enterprise Seqera Platform? I think the last time I used -with-tower was in 2018-2019 on public or other custom on-prem deployments (non-enterprise). In particular I am not really clear how this is going to work with our user and workspace configurations & permissions, and what is actually going to happen on the Platform side with this option enabled.
also, I am guessing there may be a way to enable this from within the Nextflow config file too? So that we might be able to just embed it in a config we could provide to users
F
Fellow Bee
Flamingo pink Python both nextflow and tower (seqera platform) cli uses the same env variables
export TOWER_ACCESS_TOKEN=<your token>
export TOWER_WORKSPACE=<the target workspace id (optional)>
export TOWER_API_ENDPOINT=<your server api endpoint>