Honestly I think the hardest part is identifying it and locating it.
Probably need to start a community around it. People link to stuff. Pay walls would have to be dealt with. In the vast majority of cases that’s not too difficult. No it does start getting less available around video sources. IP restrictions will be a pain in the ass.
Recording the video is messy. YT-DLP could work for some things, But honestly with the level of what’s coming I’d be afraid of leaving fingerprints on anything. Probably throwing a full screen player up in a 1080p window and using OBS on it would be the safest. Could probably get away with using an elgato to capture the HDMI signal up to 4K.
Speech to text models are light enough to run on raspberry pi. They’ll need to be vetted. They’re not highly accurate. Captioning is a great community task.
Organizing an indexing the captions, there’s no shortage of free database software. I probably start with sqlight to keep things portable and fluid. Moving to Maria or Postgres when things get too slow, But then we’re going to have to host it. Anonymous hosting is a completely different ball of wax.
Storing the data would get out of hand quickly. It’s trivial enough to buy a single 20 tb hard drive and store more than we’d need for years. But then hosting it anonymously would be difficult to say the least. Even the markers of these conversations would be traceable enough for us to be located. Paying for a private enough nude to be safe is going to be pricey over time.
I’m sure archive.org would take it, But honestly I wouldn’t put $5 on them surviving a couple of years into the new administration. What they’re doing is to inconvenient to too many corporations with deep pockets.
IPFS would work, well about as well as it works anyway, but that’s the opposite of anonymous.
Edit: come to think of it it would be a hoot to run it on the short video federated platform. Just keep the database somewhere else. Again not anonymous enough for my tastes, But what I had a little bit of fun to the project.
PShaw, that’s how I had to do it. Slackware on floppy. Pre-internet search engine, one computer per household. No cellular data.
windows -> Dial up -> look at some docs, take nodes -> reboot into Slackware -> mess with the console -> get stuck -> reboot into windows -> repeat