Let's say I'm wanting to do (small scale) distro-like work:

I want to look at a bunch of content online which comes from various authors;
I want to have a reasonably automatic process that can look at their stuff and ingest it to the warpforge system;
I expect the beginning phases of that ingesting will be a bit ugly and very unreproducible — it has to look at the internet by definition.
Ultimately I want plot files to come out.
This should be as tersely achievable as possible... Because we're going to be doing this a lot.

How big is this?

This is a very big, very end-to-end story.

It talks about what we want from L3+ tools.

It implies a lot of ingester tooling (though we're trying to just make an appropriately sized box for that).

It will take us a while to get to a reasonably complete and ready story for this.

It's not super clear anyone has ever done this particularly well. Particularly within the constraints and goals we have for clear checkpoints for snapshotting and an explicit transition to where things are now reproducible.

Walking through it

Phase 1

I have some sort of script that scrapes some internet for projects or repos or packages or whathaveyou.
This emits a list of information about projects. Probably mostly git repo URLs? Maybe other basic info? Not much that's detailed though.
- Git repo URLs are probably the easiest store.
- Sometimes there might be other easy things like tar archives.
- If it's something else that needs custom downloading and management... Well, that's less fun but we then just have to come up with a way to pass description of that onto the next phase. (We do not need to solve this in a general way! Ducttape and kludges are fine here!)
checkpoint. This is either a ware, or more likely, we actually unroll it into files and put them in a VCS. (A human intervention and review cycle opportunity is probably desirable here.)

Phase 2

We probably want to clone each of the slurped projects... Into a container. We've got all this good infra for this; let's use it. Stamping out the upstream repo into a formula input is actually legitimately the easiest way to procede!
There's probably some relatively complex tools we also put into this container. For example: something that looks for a go.mod file (or better yet, a "lock" equivalent — but that's more detail than we need for this user story overall), and then deduces that into warpforge-style WareIDs... Or catalog names. Or something.
- What, exactly, the result here is: needs defining.
- This container will probably also have network. We have not yet reached a rubicon where we're expecting things to become reproducible, or even replayable.