{ "cells": [ { "cell_type": "markdown", "id": "45a1b5d7-fd98-4fa2-9bea-e68c514b9245", "metadata": {}, "source": [ "## Notebook 4: TTS Workflow\n", "\n", "We have the exact podcast transcripts ready now to generate our audio for the Podcast.\n", "\n", "In this notebook, we will learn how to generate Audio using both `suno/bark` and `parler-tts/parler-tts-mini-v1` models first. \n", "\n", "After that, we will use the output from Notebook 3 to generate our complete podcast\n", "\n", "Note: Please feel free to extend this notebook with newer models. The above two were chosen after some tests using a sample prompt." ] }, { "cell_type": "markdown", "id": "534e5f94-66d0-459d-ab01-8599905d8e1b", "metadata": {}, "source": [ "⚠️ Warning: This notebook likes have `transformers` version to be `4.43.3` or earlier so we will downgrade our environment to make sure things run smoothly" ] }, { "cell_type": "markdown", "id": "efd866ac-8ea6-486d-96cd-7594a8c329e0", "metadata": {}, "source": [ "Credit: [This](https://colab.research.google.com/drive/1dWWkZzvu7L9Bunq9zvD-W02RFUXoW-Pd?usp=sharing#scrollTo=68QtoUqPWdLk) Colab was used for starter code\n" ] }, { "cell_type": "markdown", "id": "a4e2c0ee-7527-46e4-9c07-e6dac34376e5", "metadata": {}, "source": [ "We can install these packages for speedups" ] }, { "cell_type": "code", "execution_count": 1, "id": "3ee4811a-50a1-4030-8312-54fccddc221b", "metadata": {}, "outputs": [], "source": [ "#!pip3 install optimum\n", "#!pip install -U flash-attn --no-build-isolation\n", "#!pip install transformers==4.43.3" ] }, { "cell_type": "markdown", "id": "07672295-af30-4b4b-b11c-44ca938436cd", "metadata": {}, "source": [ "Let's import the necessary frameworks" ] }, { "cell_type": "code", "execution_count": 2, "id": "89d75859-e0f9-40e3-931d-64aa3d273f49", "metadata": {}, "outputs": [], "source": [ "from IPython.display import Audio\n", "import IPython.display as ipd\n", "from tqdm import tqdm" ] }, { "cell_type": "code", "execution_count": 3, "id": "f442758d-c48f-48ac-a4b0-558695290aa9", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Flash attention 2 is not installed\n" ] } ], "source": [ "from transformers import BarkModel, AutoProcessor, AutoTokenizer\n", "import torch\n", "import json\n", "import numpy as np\n", "from parler_tts import ParlerTTSForConditionalGeneration" ] }, { "cell_type": "markdown", "id": "31ba1903-59c8-4004-bb39-1761cd3d140e", "metadata": {}, "source": [ "### Testing the Audio Generation" ] }, { "cell_type": "markdown", "id": "2523c565-bb35-4fae-bdcb-cba11ef0b572", "metadata": {}, "source": [ "Let's try generating audio using both the models to understand how they work. \n", "\n", "Note the subtle differences in prompting:\n", "- Parler: Takes in a `description` prompt that can be used to set the speaker profile and generation speeds\n", "- Suno: Takes in expression words like `[sigh]`, `[laughs]` etc. You can find more notes on the experiments that were run for this notebook in the [TTS_Notes.md](./TTS_Notes.md) file to learn more." ] }, { "cell_type": "markdown", "id": "50b62df5-5ea3-4913-832a-da59f7cf8de2", "metadata": {}, "source": [ "Please set `device = \"cuda\"` below if you're using a single GPU node." ] }, { "cell_type": "markdown", "id": "309d0678-880b-44cb-a54a-9408b3c8d644", "metadata": {}, "source": [ "#### Parler Model\n", "\n", "Let's try using the Parler Model first and generate a short segment with speaker Laura's voice" ] }, { "cell_type": "code", "execution_count": 4, "id": "4e84ed3f-336b-4f45-b098-ce477929fa8a", "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "195613e97fe04656b3241a966c2959aa", "version_major": 2, "version_minor": 0 }, "text/plain": [ "config.json: 0%| | 0.00/6.93k [00:00 is overwritten by shared text_encoder config: T5Config {\n", " \"_name_or_path\": \"google/flan-t5-large\",\n", " \"architectures\": [\n", " \"T5ForConditionalGeneration\"\n", " ],\n", " \"classifier_dropout\": 0.0,\n", " \"d_ff\": 2816,\n", " \"d_kv\": 64,\n", " \"d_model\": 1024,\n", " \"decoder_start_token_id\": 0,\n", " \"dense_act_fn\": \"gelu_new\",\n", " \"dropout_rate\": 0.1,\n", " \"eos_token_id\": 1,\n", " \"feed_forward_proj\": \"gated-gelu\",\n", " \"initializer_factor\": 1.0,\n", " \"is_encoder_decoder\": true,\n", " \"is_gated_act\": true,\n", " \"layer_norm_epsilon\": 1e-06,\n", " \"model_type\": \"t5\",\n", " \"n_positions\": 512,\n", " \"num_decoder_layers\": 24,\n", " \"num_heads\": 16,\n", " \"num_layers\": 24,\n", " \"output_past\": true,\n", " \"pad_token_id\": 0,\n", " \"relative_attention_max_distance\": 128,\n", " \"relative_attention_num_buckets\": 32,\n", " \"tie_word_embeddings\": false,\n", " \"transformers_version\": \"4.46.1\",\n", " \"use_cache\": true,\n", " \"vocab_size\": 32128\n", "}\n", "\n", "Config of the audio_encoder: is overwritten by shared audio_encoder config: DACConfig {\n", " \"_name_or_path\": \"parler-tts/dac_44khZ_8kbps\",\n", " \"architectures\": [\n", " \"DACModel\"\n", " ],\n", " \"codebook_size\": 1024,\n", " \"frame_rate\": 86,\n", " \"latent_dim\": 1024,\n", " \"model_bitrate\": 8,\n", " \"model_type\": \"dac_on_the_hub\",\n", " \"num_codebooks\": 9,\n", " \"sampling_rate\": 44100,\n", " \"torch_dtype\": \"float32\",\n", " \"transformers_version\": \"4.46.1\"\n", "}\n", "\n", "Config of the decoder: is overwritten by shared decoder config: ParlerTTSDecoderConfig {\n", " \"_name_or_path\": \"/fsx/yoach/tmp/artefacts/parler-tts-mini/decoder\",\n", " \"activation_dropout\": 0.0,\n", " \"activation_function\": \"gelu\",\n", " \"add_cross_attention\": true,\n", " \"architectures\": [\n", " \"ParlerTTSForCausalLM\"\n", " ],\n", " \"attention_dropout\": 0.0,\n", " \"bos_token_id\": 1025,\n", " \"codebook_weights\": null,\n", " \"cross_attention_implementation_strategy\": null,\n", " \"dropout\": 0.1,\n", " \"eos_token_id\": 1024,\n", " \"ffn_dim\": 4096,\n", " \"hidden_size\": 1024,\n", " \"initializer_factor\": 0.02,\n", " \"is_decoder\": true,\n", " \"layerdrop\": 0.0,\n", " \"max_position_embeddings\": 4096,\n", " \"model_type\": \"parler_tts_decoder\",\n", " \"num_attention_heads\": 16,\n", " \"num_codebooks\": 9,\n", " \"num_cross_attention_key_value_heads\": 16,\n", " \"num_hidden_layers\": 24,\n", " \"num_key_value_heads\": 16,\n", " \"pad_token_id\": 1024,\n", " \"rope_embeddings\": false,\n", " \"rope_theta\": 10000.0,\n", " \"scale_embedding\": false,\n", " \"tie_word_embeddings\": false,\n", " \"torch_dtype\": \"float32\",\n", " \"transformers_version\": \"4.46.1\",\n", " \"use_cache\": true,\n", " \"use_fused_lm_heads\": false,\n", " \"vocab_size\": 1088\n", "}\n", "\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "eba9501465d34973a33c6435e60e9040", "version_major": 2, "version_minor": 0 }, "text/plain": [ "generation_config.json: 0%| | 0.00/265 [00:00\n", " \n", " Your browser does not support the audio element.\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Set up device\n", "device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n", "\n", "# Load model and tokenizer\n", "model = ParlerTTSForConditionalGeneration.from_pretrained(\"parler-tts/parler-tts-mini-v1\").to(device)\n", "tokenizer = AutoTokenizer.from_pretrained(\"parler-tts/parler-tts-mini-v1\")\n", "\n", "# Define text and description\n", "text_prompt = \"\"\"\n", "Exactly! And the distillation part is where you take a LARGE-model,and compress-it down into a smaller, more efficient model that can run on devices with limited resources.\n", "\"\"\"\n", "description = \"\"\"\n", "Laura's voice is expressive and dramatic in delivery, speaking at a fast pace with a very close recording that almost has no background noise.\n", "\"\"\"\n", "# Tokenize inputs\n", "input_ids = tokenizer(description, return_tensors=\"pt\").input_ids.to(device)\n", "prompt_input_ids = tokenizer(text_prompt, return_tensors=\"pt\").input_ids.to(device)\n", "\n", "# Generate audio\n", "generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)\n", "audio_arr = generation.cpu().numpy().squeeze()\n", "\n", "# Play audio in notebook\n", "ipd.Audio(audio_arr, rate=model.config.sampling_rate)" ] }, { "cell_type": "markdown", "id": "03c2abc6-4a1d-4318-af6f-0257dd66a691", "metadata": {}, "source": [ "#### Bark Model\n", "\n", "Amazing, let's try the same with bark now:\n", "- We will set the `voice_preset` to our favorite speaker\n", "- This time we can include expression prompts inside our generation prompt\n", "- Note you can CAPTILISE words to make the model emphasise on these\n", "- You can add hyphens to make the model pause on certain words" ] }, { "cell_type": "code", "execution_count": 5, "id": "a20730f0-13dd-48b4-80b6-7c6ef05a0cc4", "metadata": {}, "outputs": [], "source": [ "voice_preset = \"v2/en_speaker_6\"\n", "sampling_rate = 24000" ] }, { "cell_type": "code", "execution_count": 6, "id": "246d0cbc-c5d8-4f34-b8e4-dd18a624cdad", "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "0d72afdeaff141fd9acc22aaa5d63e20", "version_major": 2, "version_minor": 0 }, "text/plain": [ "tokenizer_config.json: 0%| | 0.00/353 [00:00\n", " \n", " Your browser does not support the audio element.\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text_prompt = \"\"\"\n", "Exactly! [sigh] And the distillation part is where you take a LARGE-model,and compress-it down into a smaller, more efficient model that can run on devices with limited resources.\n", "\"\"\"\n", "inputs = processor(text_prompt, voice_preset=voice_preset).to(device)\n", "\n", "speech_output = model.generate(**inputs, temperature = 0.9, semantic_temperature = 0.8)\n", "Audio(speech_output[0].cpu().numpy(), rate=sampling_rate)" ] }, { "cell_type": "markdown", "id": "dd650176-ab17-47a7-8e02-10dc9ca9e852", "metadata": {}, "source": [ "## Bringing it together: Making the Podcast\n", "\n", "Okay now that we understand everything-we can now use the complete pipeline to generate the entire podcast\n", "\n", "Let's load in our pickle file from earlier and proceed:" ] }, { "cell_type": "code", "execution_count": 8, "id": "b1dca30f-1226-4002-8e02-fd97e78ecc83", "metadata": {}, "outputs": [], "source": [ "import pickle\n", "\n", "with open('./resources/podcast_ready_data.pkl', 'rb') as file:\n", " PODCAST_TEXT = pickle.load(file)" ] }, { "cell_type": "markdown", "id": "c10a3d50-08a7-4786-8e28-8fb6b8b048ab", "metadata": {}, "source": [ "Let's define load in the bark model and set it's hyper-parameters for discussions" ] }, { "cell_type": "code", "execution_count": 9, "id": "8db78921-36c7-4388-b1d9-78dff4f972c2", "metadata": {}, "outputs": [], "source": [ "bark_processor = AutoProcessor.from_pretrained(\"suno/bark\")\n", "bark_model = BarkModel.from_pretrained(\"suno/bark\", torch_dtype=torch.float16).to(\"cuda\")\n", "bark_sampling_rate = 24000" ] }, { "cell_type": "markdown", "id": "e03e313a-c727-4489-876b-db71920d49cd", "metadata": {}, "source": [ "Now for the Parler model:" ] }, { "cell_type": "code", "execution_count": 10, "id": "6c04a04d-3686-4932-bd45-72d7f518c602", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Config of the text_encoder: is overwritten by shared text_encoder config: T5Config {\n", " \"_name_or_path\": \"google/flan-t5-large\",\n", " \"architectures\": [\n", " \"T5ForConditionalGeneration\"\n", " ],\n", " \"classifier_dropout\": 0.0,\n", " \"d_ff\": 2816,\n", " \"d_kv\": 64,\n", " \"d_model\": 1024,\n", " \"decoder_start_token_id\": 0,\n", " \"dense_act_fn\": \"gelu_new\",\n", " \"dropout_rate\": 0.1,\n", " \"eos_token_id\": 1,\n", " \"feed_forward_proj\": \"gated-gelu\",\n", " \"initializer_factor\": 1.0,\n", " \"is_encoder_decoder\": true,\n", " \"is_gated_act\": true,\n", " \"layer_norm_epsilon\": 1e-06,\n", " \"model_type\": \"t5\",\n", " \"n_positions\": 512,\n", " \"num_decoder_layers\": 24,\n", " \"num_heads\": 16,\n", " \"num_layers\": 24,\n", " \"output_past\": true,\n", " \"pad_token_id\": 0,\n", " \"relative_attention_max_distance\": 128,\n", " \"relative_attention_num_buckets\": 32,\n", " \"tie_word_embeddings\": false,\n", " \"transformers_version\": \"4.46.1\",\n", " \"use_cache\": true,\n", " \"vocab_size\": 32128\n", "}\n", "\n", "Config of the audio_encoder: is overwritten by shared audio_encoder config: DACConfig {\n", " \"_name_or_path\": \"parler-tts/dac_44khZ_8kbps\",\n", " \"architectures\": [\n", " \"DACModel\"\n", " ],\n", " \"codebook_size\": 1024,\n", " \"frame_rate\": 86,\n", " \"latent_dim\": 1024,\n", " \"model_bitrate\": 8,\n", " \"model_type\": \"dac_on_the_hub\",\n", " \"num_codebooks\": 9,\n", " \"sampling_rate\": 44100,\n", " \"torch_dtype\": \"float32\",\n", " \"transformers_version\": \"4.46.1\"\n", "}\n", "\n", "Config of the decoder: is overwritten by shared decoder config: ParlerTTSDecoderConfig {\n", " \"_name_or_path\": \"/fsx/yoach/tmp/artefacts/parler-tts-mini/decoder\",\n", " \"activation_dropout\": 0.0,\n", " \"activation_function\": \"gelu\",\n", " \"add_cross_attention\": true,\n", " \"architectures\": [\n", " \"ParlerTTSForCausalLM\"\n", " ],\n", " \"attention_dropout\": 0.0,\n", " \"bos_token_id\": 1025,\n", " \"codebook_weights\": null,\n", " \"cross_attention_implementation_strategy\": null,\n", " \"dropout\": 0.1,\n", " \"eos_token_id\": 1024,\n", " \"ffn_dim\": 4096,\n", " \"hidden_size\": 1024,\n", " \"initializer_factor\": 0.02,\n", " \"is_decoder\": true,\n", " \"layerdrop\": 0.0,\n", " \"max_position_embeddings\": 4096,\n", " \"model_type\": \"parler_tts_decoder\",\n", " \"num_attention_heads\": 16,\n", " \"num_codebooks\": 9,\n", " \"num_cross_attention_key_value_heads\": 16,\n", " \"num_hidden_layers\": 24,\n", " \"num_key_value_heads\": 16,\n", " \"pad_token_id\": 1024,\n", " \"rope_embeddings\": false,\n", " \"rope_theta\": 10000.0,\n", " \"scale_embedding\": false,\n", " \"tie_word_embeddings\": false,\n", " \"torch_dtype\": \"float32\",\n", " \"transformers_version\": \"4.46.1\",\n", " \"use_cache\": true,\n", " \"use_fused_lm_heads\": false,\n", " \"vocab_size\": 1088\n", "}\n", "\n" ] } ], "source": [ "parler_model = ParlerTTSForConditionalGeneration.from_pretrained(\"parler-tts/parler-tts-mini-v1\").to(\"cuda\")\n", "parler_tokenizer = AutoTokenizer.from_pretrained(\"parler-tts/parler-tts-mini-v1\")" ] }, { "cell_type": "code", "execution_count": 11, "id": "efbe1434-37f3-4f77-a5fb-b39625f5e676", "metadata": {}, "outputs": [], "source": [ "speaker1_description = \"\"\"\n", "Laura's voice is expressive and dramatic in delivery, speaking at a moderately fast pace with a very close recording that almost has no background noise.\n", "\"\"\"\n", "speaker2_description = \"\"\"\n", "Toms's voice is smooth and suave in delivery, speaking at a slow, methodic pace with a very close recording that almost has no background noise.\n", "\"\"\"" ] }, { "cell_type": "markdown", "id": "56f6fa24-fe07-4702-850f-0428bfadd2dc", "metadata": {}, "source": [ "We will concatenate the generated segments of audio and also their respective sampling rates since we will require this to generate the final audio" ] }, { "cell_type": "code", "execution_count": 12, "id": "cebfd0f9-8703-4fce-b207-014c6e16cc8a", "metadata": {}, "outputs": [], "source": [ "generated_segments = []\n", "sampling_rates = [] # We'll need to keep track of sampling rates for each segment" ] }, { "cell_type": "code", "execution_count": 13, "id": "9b333e36-9579-4237-b329-e2911229be42", "metadata": {}, "outputs": [], "source": [ "device=\"cuda\"" ] }, { "cell_type": "markdown", "id": "d7b2490c-012f-4e35-8890-cd6a5eaf4cc4", "metadata": {}, "source": [ "Function generate text for speaker 1" ] }, { "cell_type": "code", "execution_count": 14, "id": "50323f9e-09ed-4c8c-9020-1511ab775969", "metadata": {}, "outputs": [], "source": [ "def generate_speaker1_audio(text):\n", " \"\"\"Generate audio using ParlerTTS for Speaker 1\"\"\"\n", " input_ids = parler_tokenizer(speaker1_description, return_tensors=\"pt\").input_ids.to(device)\n", " prompt_input_ids = parler_tokenizer(text, return_tensors=\"pt\").input_ids.to(device)\n", " generation = parler_model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)\n", " audio_arr = generation.cpu().numpy().squeeze()\n", " return audio_arr, parler_model.config.sampling_rate" ] }, { "cell_type": "markdown", "id": "3fb5dac8-30a6-4aa2-a983-b5f1df3d56af", "metadata": {}, "source": [ "Function to generate text for speaker 2" ] }, { "cell_type": "code", "execution_count": 15, "id": "0e6120ba-5190-4739-97ca-4e8b44dddc5e", "metadata": {}, "outputs": [], "source": [ "# def generate_speaker2_audio(text):\n", "# \"\"\"Generate audio using ParlerTTS for Speaker 2\"\"\"\n", "# input_ids = parler_tokenizer(speaker2_description, return_tensors=\"pt\").input_ids.to(device)\n", "# prompt_input_ids = parler_tokenizer(text, return_tensors=\"pt\").input_ids.to(device)\n", "# generation = parler_model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids)\n", "# audio_arr = generation.cpu().numpy().squeeze()\n", "# return audio_arr, parler_model.config.sampling_rate\n", "def generate_speaker2_audio(text):\n", " \"\"\"Generate audio using Bark for Speaker 2\"\"\"\n", " inputs = bark_processor(text, voice_preset=\"v2/en_speaker_6\").to(device)\n", " speech_output = bark_model.generate(**inputs, temperature=0.9, semantic_temperature=0.8)\n", " audio_arr = speech_output[0].cpu().numpy()\n", " return audio_arr, bark_sampling_rate\n" ] }, { "cell_type": "markdown", "id": "7ea67fd1-9405-4fce-b08b-df5e11d0bf37", "metadata": {}, "source": [ "Helper function to convert the numpy output from the models into audio" ] }, { "cell_type": "code", "execution_count": 16, "id": "4482d864-2806-4410-b239-da4b2d0d1340", "metadata": {}, "outputs": [], "source": [ "import io\n", "from scipy.io import wavfile\n", "from pydub import AudioSegment\n", "\n", "def numpy_to_audio_segment(audio_arr, sampling_rate):\n", " \"\"\"Convert numpy array to AudioSegment\"\"\"\n", " # Convert to 16-bit PCM\n", " audio_int16 = (audio_arr * 32767).astype(np.int16)\n", " \n", " # Create WAV file in memory\n", " byte_io = io.BytesIO()\n", " \n", " wavfile.write(byte_io, sampling_rate, audio_int16)\n", " byte_io.seek(0)\n", " \n", " # Convert to AudioSegment\n", " return AudioSegment.from_wav(byte_io)" ] }, { "cell_type": "code", "execution_count": 19, "id": "c4dbb3b3-cdd3-4a1f-a60a-661e64a67f53", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'[\\n (\"Speaker 1\", \"Welcome to \\'The Think Tank\\'! I\\'m Rachel, and I\\'m thrilled to have my co-host, Alex, joining me on this thought-provoking journey. We\\'re diving into the fascinating world of artificial intelligence, and I\\'m excited to share with you the latest advancements and insights in this field. Alex, let\\'s start with the basics. What\\'s your take on AI?\"),\\n (\"Speaker 2\", \"I think AI is like the ultimate game-changer. We\\'re already using chatbots to book our flights and hotels. But with the rise of natural language processing, AI is becoming increasingly sophisticated. It\\'s like having a super-smart personal assistant that can learn and adapt to our needs. Umm, like Alexa or Siri?\"),\\n (\"Speaker 1\", \"That\\'s right, Alex. And I love the analogy of a personal assistant. You know, I was talking to a friend who\\'s a tech entrepreneur, and he said that AI is like the ultimate enabler. It\\'s allowing companies to automate tasks that were previously done by humans, freeing up resources for more strategic and creative work. Hmm, it\\'s amazing to think about how much more efficient our daily lives could be with AI.\"),\\n (\"Speaker 2\", \"Hmmm, right... Yeah, that makes sense. I mean, we\\'ve seen companies like Netflix and Amazon using AI to personalize their services. It\\'s like having a super-smart algorithm that knows exactly what you want before you even ask for it. [laughs] I mean, have you ever ordered something on Amazon and gotten a personalized recommendation? That\\'s some crazy AI magic!\"),\\n (\"Speaker 1\", \"Exactly! And that\\'s what I love about AI. It\\'s not just about automating tasks, it\\'s about creating new experiences and opportunities. I was talking to a colleague who\\'s a data scientist, and he said that AI is going to revolutionize the way we approach healthcare. We\\'ll be able to analyze medical data in ways that were previously impossible, leading to breakthroughs in disease diagnosis and treatment. Umm, it\\'s a pretty exciting space.\"),\\n (\"Speaker 2\", \"Umm, excuse me, Rachel. I think you\\'re glossing over the potential risks of AI. I mean, we\\'ve seen movies like \\'The Terminator\\' and \\'Ex Machina\\' that showcase the darker side of AI. What about the job displacement concerns? We can\\'t just create machines that do everything better than humans without thinking about the consequences. Sigh, we have to consider the social implications.\"),\\n (\"Speaker 1\", \"That\\'s a great point, Alex. And I think that\\'s where the conversation gets really interesting. I mean, we\\'re not just talking about replacing humans with machines, we\\'re talking about augmenting human capabilities. And that\\'s where the potential for real-world impact comes in. I was talking to a startup founder who\\'s working on an AI-powered platform that\\'s helping small businesses manage their finances. It\\'s amazing to see how AI can be used to level the playing field for entrepreneurs. Hmm, I think that\\'s a really important message.\"),\\n (\"Speaker 2\", \"So, Rachel, can you give us an example of how AI is being used to solve real-world problems? Like, what\\'s the story behind that startup you mentioned? I mean, how does it actually work? Is it like machine learning or deep learning? [laughs] I\\'m so curious!\"),\\n (\"Speaker 1\", \"Well, actually, the story goes like this: Meet Maria, a single mom who owns a small bakery in a low-income neighborhood. She was struggling to manage her finances, juggling multiple clients and suppliers. But with the help of our startup, she was able to streamline her operations, optimize her pricing, and even start selling her products online. It\\'s amazing to see how AI-powered tools could help her scale her business and improve her bottom line. The technology is actually a hybrid approach that combines both machine learning and rule-based systems. It\\'s like a smart dashboard that provides real-time insights and recommendations to businesses like Maria\\'s. Umm, it\\'s pretty fascinating!\"),\\n (\"Speaker 2\", \"Wow, that\\'s incredible! And what about the potential for AI to help solve global problems like climate change? I mean, can AI really help us find sustainable solutions? We need to explore that topic!\"),\\n (\"Speaker 1\", \"That\\'s a great question, Alex. And I think AI can indeed play a crucial role in addressing global challenges. For example, AI can help optimize energy consumption, predict weather patterns, and even aid in disaster response. It\\'s amazing to see how AI can be used to drive positive change. Hmm, I think we\\'re just scratching the surface of what\\'s possible with AI.\"),\\n (\"Speaker 2\", \"I completely agree! And what about the ethics of AI development? I mean, who\\'s responsible for ensuring that AI is developed in a way that\\'s transparent and fair? We need to have those conversations!\"),\\n (\"Speaker 1\", \"That\\'s a fantastic point, Alex. And I think it\\'s essential to have ongoing discussions about the ethics and governance of AI development. We need to work together to ensure that AI is developed in a way that benefits society as a whole. Umm, it\\'s a complex issue, but I think we\\'re on the right track.\"),\\n]'" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "PODCAST_TEXT[:-1]" ] }, { "cell_type": "markdown", "id": "485b4c9e-379f-4004-bdd0-93a53f3f7ee0", "metadata": {}, "source": [ "Most of the times we argue in life that Data Structures isn't very useful. However, this time the knowledge comes in handy. \n", "\n", "We will take the string from the pickle file and load it in as a Tuple with the help of `ast.literal_eval()`" ] }, { "cell_type": "code", "execution_count": 20, "id": "9946e46c-3457-4bf9-9042-b89fa8f5b47a", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('Speaker 1',\n", " \"Welcome to 'The Think Tank'! I'm Rachel, and I'm thrilled to have my co-host, Alex, joining me on this thought-provoking journey. We're diving into the fascinating world of artificial intelligence, and I'm excited to share with you the latest advancements and insights in this field. Alex, let's start with the basics. What's your take on AI?\"),\n", " ('Speaker 2',\n", " \"I think AI is like the ultimate game-changer. We're already using chatbots to book our flights and hotels. But with the rise of natural language processing, AI is becoming increasingly sophisticated. It's like having a super-smart personal assistant that can learn and adapt to our needs. Umm, like Alexa or Siri?\"),\n", " ('Speaker 1',\n", " \"That's right, Alex. And I love the analogy of a personal assistant. You know, I was talking to a friend who's a tech entrepreneur, and he said that AI is like the ultimate enabler. It's allowing companies to automate tasks that were previously done by humans, freeing up resources for more strategic and creative work. Hmm, it's amazing to think about how much more efficient our daily lives could be with AI.\"),\n", " ('Speaker 2',\n", " \"Hmmm, right... Yeah, that makes sense. I mean, we've seen companies like Netflix and Amazon using AI to personalize their services. It's like having a super-smart algorithm that knows exactly what you want before you even ask for it. [laughs] I mean, have you ever ordered something on Amazon and gotten a personalized recommendation? That's some crazy AI magic!\"),\n", " ('Speaker 1',\n", " \"Exactly! And that's what I love about AI. It's not just about automating tasks, it's about creating new experiences and opportunities. I was talking to a colleague who's a data scientist, and he said that AI is going to revolutionize the way we approach healthcare. We'll be able to analyze medical data in ways that were previously impossible, leading to breakthroughs in disease diagnosis and treatment. Umm, it's a pretty exciting space.\"),\n", " ('Speaker 2',\n", " \"Umm, excuse me, Rachel. I think you're glossing over the potential risks of AI. I mean, we've seen movies like 'The Terminator' and 'Ex Machina' that showcase the darker side of AI. What about the job displacement concerns? We can't just create machines that do everything better than humans without thinking about the consequences. Sigh, we have to consider the social implications.\"),\n", " ('Speaker 1',\n", " \"That's a great point, Alex. And I think that's where the conversation gets really interesting. I mean, we're not just talking about replacing humans with machines, we're talking about augmenting human capabilities. And that's where the potential for real-world impact comes in. I was talking to a startup founder who's working on an AI-powered platform that's helping small businesses manage their finances. It's amazing to see how AI can be used to level the playing field for entrepreneurs. Hmm, I think that's a really important message.\"),\n", " ('Speaker 2',\n", " \"So, Rachel, can you give us an example of how AI is being used to solve real-world problems? Like, what's the story behind that startup you mentioned? I mean, how does it actually work? Is it like machine learning or deep learning? [laughs] I'm so curious!\"),\n", " ('Speaker 1',\n", " \"Well, actually, the story goes like this: Meet Maria, a single mom who owns a small bakery in a low-income neighborhood. She was struggling to manage her finances, juggling multiple clients and suppliers. But with the help of our startup, she was able to streamline her operations, optimize her pricing, and even start selling her products online. It's amazing to see how AI-powered tools could help her scale her business and improve her bottom line. The technology is actually a hybrid approach that combines both machine learning and rule-based systems. It's like a smart dashboard that provides real-time insights and recommendations to businesses like Maria's. Umm, it's pretty fascinating!\"),\n", " ('Speaker 2',\n", " \"Wow, that's incredible! And what about the potential for AI to help solve global problems like climate change? I mean, can AI really help us find sustainable solutions? We need to explore that topic!\"),\n", " ('Speaker 1',\n", " \"That's a great question, Alex. And I think AI can indeed play a crucial role in addressing global challenges. For example, AI can help optimize energy consumption, predict weather patterns, and even aid in disaster response. It's amazing to see how AI can be used to drive positive change. Hmm, I think we're just scratching the surface of what's possible with AI.\"),\n", " ('Speaker 2',\n", " \"I completely agree! And what about the ethics of AI development? I mean, who's responsible for ensuring that AI is developed in a way that's transparent and fair? We need to have those conversations!\"),\n", " ('Speaker 1',\n", " \"That's a fantastic point, Alex. And I think it's essential to have ongoing discussions about the ethics and governance of AI development. We need to work together to ensure that AI is developed in a way that benefits society as a whole. Umm, it's a complex issue, but I think we're on the right track.\")]" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import ast\n", "ast.literal_eval(PODCAST_TEXT[:-1])" ] }, { "cell_type": "markdown", "id": "5c7b4c11-5526-4b13-b0a2-8ca541c475aa", "metadata": {}, "source": [ "#### Generating the Final Podcast\n", "\n", "Finally, we can loop over the Tuple and use our helper functions to generate the audio" ] }, { "cell_type": "code", "execution_count": 21, "id": "c640fead-2017-478f-a7b6-1b96105d45d6", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Generating podcast segments: 8%|▊ | 1/13 [00:28<05:46, 28.83s/segment]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n", "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n", "Generating podcast segments: 23%|██▎ | 3/13 [01:41<05:37, 33.75s/segment]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n", "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n", "Generating podcast segments: 38%|███▊ | 5/13 [03:04<05:03, 37.98s/segment]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n", "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n", "Generating podcast segments: 54%|█████▍ | 7/13 [04:30<04:05, 40.86s/segment]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n", "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n", "Generating podcast segments: 69%|██████▉ | 9/13 [05:59<02:51, 42.78s/segment]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n", "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n", "Generating podcast segments: 85%|████████▍ | 11/13 [07:10<01:16, 38.48s/segment]The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\n", "Setting `pad_token_id` to `eos_token_id`:None for open-end generation.\n", "Generating podcast segments: 100%|██████████| 13/13 [08:21<00:00, 38.57s/segment]\n" ] } ], "source": [ "final_audio = None\n", "\n", "for speaker, text in tqdm(ast.literal_eval(PODCAST_TEXT[:-1]), desc=\"Generating podcast segments\", unit=\"segment\"):\n", " if speaker == \"Speaker 1\":\n", " audio_arr, rate = generate_speaker1_audio(text)\n", " else: # Speaker 2\n", " audio_arr, rate = generate_speaker2_audio(text)\n", " \n", " # Convert to AudioSegment (pydub will handle sample rate conversion automatically)\n", " audio_segment = numpy_to_audio_segment(audio_arr, rate)\n", " \n", " # Add to final audio\n", " if final_audio is None:\n", " final_audio = audio_segment\n", " else:\n", " final_audio += audio_segment" ] }, { "cell_type": "markdown", "id": "4fbb2228-8023-44c4-aafe-d6e1d22ff8e4", "metadata": {}, "source": [ "### Output the Podcast\n", "\n", "We can now save this as a mp3 file" ] }, { "cell_type": "code", "execution_count": 22, "id": "2eeffdb7-875a-45ec-bdd8-c8c5b34f5a7b", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "<_io.BufferedRandom name='./resources/_podcast.mp3'>" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "final_audio.export(\"./resources/_podcast.mp3\", \n", " format=\"mp3\", \n", " bitrate=\"192k\",\n", " parameters=[\"-q:a\", \"0\"])" ] }, { "cell_type": "markdown", "id": "c7ce5836", "metadata": {}, "source": [ "### Suggested Next Steps:\n", "\n", "- Experiment with the prompts: Please feel free to experiment with the SYSTEM_PROMPT in the notebooks\n", "- Extend workflow beyond two speakers\n", "- Test other TTS Models\n", "- Experiment with Speech Enhancer models as a step 5." ] }, { "cell_type": "code", "execution_count": null, "id": "26cc56c5-b9c9-47c2-b860-0ea9f05c79af", "metadata": {}, "outputs": [], "source": [ "#fin" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.10" } }, "nbformat": 4, "nbformat_minor": 5 }