Web Dev

ML/AI

Turn yourself into a rapper using AI

Raza

• Feb 13, 2023

• 9 min read

GPT-3 generates text. ElevenLabs generates speech. You'll generate fire.

I'll get straight to it - I built an app that raps on any topic in the style of Eminem using my voice. Check it out:

How? Using an AI powered text-to-speech (TTS) API. You’ve probably never really thought about building with TTS services. Most of the current ones suck, and the only time you interact with them is when watching livestreams or automated answering systems.

That changes today - these new APIs are so good you’ll be questioning your own voice. Narrate ebooks or articles in your own voice. Create high-quality content voice-overs in a noisy train. Make Steve Jobs sing Hakuna Matata.

AI tools are evolving at neck breaking speed, it’s like we’re hitting the singularity in real time - you can generate speech in your own voice in less than 10 minutes 🤯

Here's my app — give it a whirl!

Here’s a TL;DR on how to build it in case you’re in a rush:

Record your voice saying fancy words, upload that recording to Eleven Labs
Use GPT3 to generate lyrics in the style of any rapper on any topic
Feed the generated lyrics to ElevenLabs to get audio of you reading out those lyrics
Slap these APIs into a Next.js app

Replit template here (faster). Github repo here (more features).

Now anyone can make you rap about almost anything.

Tools you need to get started

There are only three tools you need.

Replit OR VS Code + Railway.app (all free)
GPT-3 ($18 in free credits).
ElevenLabs starter - free for the first month but you need a credit card
A glass of water (tap water is ok)
Your voice

Btw, shipping this app sets you up to join 1000s of other builders for a 6-week sprint called Nights & Weekends. You'll get tons of support and feedback to turn any idea into a product or company.

All you gotta do is apply here.

Record and clone your voice

This part may sound like the easiest, but it’s actually pretty tricky! You’ll need a quiet-ish environment for this and the best mic you have. The mic on your smartphone will work best if you don’t have a condenser microphone.

You’ll need to record 3-4 minutes of yourself reading this. It’s a special script made for training voice samples. Use any recording software, I used the Windows voice recorder.

Here are some guidelines on how you should read it:

Stay close to the mic - the audio you’re recording is for algorithms, not humans. Loud is okay.
Start by reading in your neutral voice - just like how you usually talk
After about 30s, slowly transition to a more expressive style of speaking - put emphasis on certain words
After about 30s, speak louder - then back to normal

Avoid being monotone and talking super consistently! You want to stretch your voice and give it as much data as possible. Just don’t yell or go too crazy as that will impact your results lol.

Once you’re done, head over to the ElevenLabs voice lab and upload your voice sample. You should be able to use it right then! Try it out by generating a few sentences you usually say. You’ll feel if it sounds like you. If you’re satisfied, go on forth! If not, try recording another sample and create a new clone lol. It took me 3 tries to get something I was happy with, and it still has an extra-American accent.

Generating lyrics

This one’s easy! We’ll be using our old friend GPT-3 for this. Pull up the OpenAI playground and put in a prompt like this:

Give me lyrics for a rap song in the style of <ARTIST> on the following topic. Respond with only the lyrics and nothing else. Topic: <Add topic here>

Replace <ARTIST> and put in a topic. I’m an Eminem fan so this is how mine went:

Pretty cool, eh? If you use specific words, you can tell it to use those in the lyrics when it makes sense! I’ve been told I say “proper” a lot so I told GPT-3 to use it if it sounded good lol.

You can also try this fancy prompt if you want to be very specific about what your lyrics should contain:

I want you to act as a rapper. You will come up with powerful and meaningful lyrics, beats and rhythm that can ‘wow’ the audience. Your lyrics should have an intriguing meaning and message which people can relate too. When it comes to choosing your beat, make sure it is catchy yet relevant to your words, so that when combined they make an explosion of sound every time! My first request is "I need a rap song about finding strength within yourself in the style of Dababy."

Get your keys

We’ll need three things here - an OpenAI API key, an ElevenLabs API key, and your voice ID.

Get your OpenAI API key here. Make sure your credits haven’t expired.
Click your icon on the top right in Eleven labs, then click profile and copy the API key.
Head over to the API “docs” for ElevenLabs. This link will take you to the /v1/voices endpoint. Click “Try it out”, paste your API key in, and click execute. This will make an API call using your browser and show you the response. Find your voice in the list of results and copy the ID. We’re good to go!

Time to sing

I’ve done all the hard work on this one. All you have to do is clone this repo OR fork this Replit.

If you went with the Github repo, make a .env file, and paste the keys in like this:

Smash in npm i and then npm run dev. This will pull your app up on localhost:3000— drop a topic in there and generate!

If you're on Replit, create 3 secrets and name them as above. Now hit the big green start button at the top. Your app is ready!

It’ll take about ~10-15 seconds for the two API calls and you should hear your incredible voice rapping!

Don’t like the generated results? Head back to the ElevenLabs API interface and scroll down to the /v1/voices/{voice_id}/settings/edit endpoint, paste in the voice_id, the API key and something like this in the body:

This will adjust the results - lower stability makes it sound more “human” but can result in some wackiness. Vales are from 0-1, I usually got with 0.2 and 0.3. You can play around with this on the synthesis screen, I’m updating it via the API cause it gets saved that way. If you ever wanna update this programmatically, I have an updateSettings.js endpoint that does this in the utils folder.

How's this all work?

There’s only two API calls being made under the hood. The first one generates the lyrics using this prompt and whatever the user puts in:

The fun part is the post-processing on this in index.js:

Because OpenAI doesn’t return perfectly formatted stuff, we have to do some clean up. No matter how I changed my prompt, GPT-3 seems to return lyrics with [Verse 1] and [Chorus]. I decided to just remove them via JS lol. We’re also removing any short lines.

Finally, we’re adding punctuation — this is really important because it’s how the ElevenLabs API knows to frame sentences. For example, if the sentence ends in an exclamation mark (!), the TTS API adds emphasis. If there's no punctuation it just sounds really weak.

Not the most elegant solution, but it works!

The other half of this is the textToSpeech.js endpoint. It's a pretty standard API call, the only unfamiliar bit you'll see is the return:

Instead of returning JSON like you’d usually do, we’re returning an audio/mpeg stream, and we’re using data.pipe(res) to “pipe” it to the front-end. I asked ChatGPT wtf piping means and what a stream is, here’s what it said:

"Piping data in a stream means to take some data (in this case, audio data) and "flow" it from one place to another, typically for the purpose of processing it in stages. In the code you provided, the audio data stored in the "data" variable is being piped to the "res" variable, which represents the response being sent back to the client. The "pipe" method is a way to chain together multiple operations to be performed on the data as it flows through the stream. The "finish" event is emitted once all data has been processed and the stream is finished."

Nice! That makes enough sense lol

Here's how I'm receiving it on the front-end:

Ezpz! Only took me an hour of searching to find these three lines 😂

Play around with the playback rate here. I found 1.2 too much and lowered it to 1.1 later. For other genres, you'll probably wanna remove this!

Running out of ElevenLabs API credits? Use your browser! The browser has a pretty nifty voice synthesis API built in. It’s wonky and robotic compared to ElevenLabs, but it’s free! It’s pretty simple, check out how I implemented it here.

WOOOO we’re done! Feel free to play around with this template, you can replace the music with something of your own tastes and maybe even change the genre of music :)

If you did this on Replit, check out the GitHub repo for extra functionality - I got a bit carried away and added daily IP limits, replay functionality, and even "previous songs".

Wtf do I do now?

You’ve just picked up two hot new skills - text generation with GPT-3 and speech generation probably the world’s best TTS API.

Make new things with these skills!

There are so many cool things you can build with these two. Speech-based tech support. Stand-up comedy generator. Audio summaries of books or articles. ChatGPT powered answering machine.

The only limits are the API limits.

All it takes is one link to make your mark.

Here's the proof.

When you're ready - take it to the next level:

Cya off localhost!

Raza

Join the the world's best builders for a 6-week sprint

Come join the best builders from around the world to build wild ideas in web3, ML/AI, gaming, bio-anything. You've got what it takes - all you need to do is apply

Apply to N&W

buildspace