How to merge multiple videos into one, with text overlay (timed according to voice audio)

Hi everyone, I'll do my best to explain. So I have a database of real estate properties, each one ha videos of a house and data like the amount of rooms, bathrooms, etc.

I'll feed all that house data into chatgpt so that it can create a video script, and then I'll turn that script into voice with Elevenlabs.

After that I want to merge all those video shots of the house into one, for example let's say the scripted voice talks about the kitchen for the first 5 seconds, so the kitchen video and also some text overlay saying "kitchen" would appear in that part.

Then for the next video shot, the process would repeat and so on...And the final result would be like a slideshow video showing each parts of the house, with appropriate text overlay on each part, and the elevenlabs voice on top (timed accordingly of course).

Is the no code architects tool kit capable of doing this? I really wanna use it haha. And if not what tools would you recommend? Thank you! I would appreciate any help.

9 comments