Enhancing My AI App: Project Update

Hi everyone,

I'm back with a big update to my second AI project, VidNote AI (the YouTube summarizer).

In my first post about this project, I shared my debugging story of fixing crashes and making the app robust. But after using it, I realized it still had one major flaw.

It was this simple-looking text box:

This was terrible for the user. How would anyone know the correct language code for a video ('en', 'hi', 'es')? They would have to be a detective just to use my app. It was bad design, and I was frustrated with it.

So, I fixed it.

Here's the New, Smarter App

The "Video Language" input is completely gone.

Now, you just paste in any YouTube URL, and the app automatically finds the best possible transcript, translates it if necessary, and generates the notes. No more guessing.

Here's a quick demo video of the new update:

https://youtu.be/A6VCTTjInjE

And here's a screenshot of the new, clean UI after it runs:

It's a much cleaner and smarter experience.

➡️ Try the Live App: https://ai-youtube-assistant-by-saiteja-puttoju.streamlit.app/

➡️ See the Full Code: https://github.com/saiteja-puttoju/ai-youtube-assistant.git

My Debugging Story: From "Dumb" to "Smart"

My mission was to make the app figure out the language on its own.

I. The Problem (What Went Wrong)

My old get_transcripts function (from the first version) was too simple. It required the user to provide a language code.

This failed in many ways:

If the user typed 'en' but the video only had an auto-generated (not manual) transcript, the API call might fail.
If the user wanted to summarize a video in another language (e.g., Hindi), they would have to know the code was 'hi'.
It put all the work on the user, which is the #1 sign of bad design.

II. The Investigation (How I Fixed It)

I needed to stop asking the user and just find the best transcript myself.

I went back and read the youtube-transcript-api documentation more carefully and discovered two key things:

I can get a list of all available transcripts for a video using ytt_api.list(video_id=video_id).
This list tells me if a transcript is manual (uploaded by the creator, high quality) or generated (auto-created by YouTube, lower quality).

This was my "Aha!" moment. I could build a priority system.

My logic would be:

Always prefer a manual 'en' transcript.
If not, take any other manual transcript (and translate it).
If not, take a generated 'en' transcript.
As a last resort, take any other generated transcript (and translate it).

This logic would always get me the highest-quality transcript available, no matter what.

III. The Solution (The Code)

I broke the solution into two parts: a new "smart" function in the backend, and a cleaner UI in the frontend.

Part 1: The New Backend Function (supporting_functions.py)

I wrote a new function called get_best_transcript(video_id).

This function does all the heavy lifting. It gets the full list, sorts all available transcripts into manual_codes and generated_codes, and then builds the final_priority_list based on my logic above.

Here is the key part of the code from supporting_functions.py:

# [Inside supporting_functions.py]

def get_best_transcript(video_id: str) -> tuple[list, str] | tuple[None, str]:
    """
    Fetches the best available transcript (any language).
    Prioritizes:
    1. Manual 'en'
    2. Other Manual
    3. Generated 'en'
    4. Other Generated
    """

    manual_codes = []
    generated_codes = []

    try:
        # 1. Create an instance and get the list object
        ytt_api = YouTubeTranscriptApi()
        transcript_list = ytt_api.list(video_id=video_id)

        # 2. Sort all available codes into two lists
        for transcript in transcript_list:
            if transcript.is_generated:
                generated_codes.append(transcript.language_code)
            else:
                manual_codes.append(transcript.language_code)

    except TranscriptsDisabled:
        return None, "Transcripts are disabled for this video."
    # ... (other error handling) ...

    # --- 3. Build the Final Prioritized List ---
    final_priority_list = []

    if 'en' in manual_codes:
        final_priority_list.append('en')
        manual_codes.remove('en')

    final_priority_list.extend(manual_codes) # Add all other manual transcripts

    if 'en' in generated_codes and 'en' not in final_priority_list:
        final_priority_list.append('en')
        generated_codes.remove('en')

    final_priority_list.extend(generated_codes) # Add all other generated ones

    # --- 4. Fetch the transcript using the list ---
    if final_priority_list:
        try:
            # Find the best transcript object from our priority list
            transcript_object = transcript_list.find_transcript(final_priority_list)

            # Fetch the actual data
            transcript_data = transcript_object.fetch()

            # Return the data and its language
            return transcript_data, transcript_object.language_code

        except Exception as e:
            return None, f"Could not fetch transcript data: {e}"
    else:
        # No transcripts were found at all
        return None, "No transcripts found for this video."

This function now returns both the transcript data and the language code it found (e.g., 'es', 'hi', 'en').

Part 2: The New Frontend Logic (app.py)

With the "smart" backend function, my app.py file became much simpler.

I deleted the st.text_input for language.
I call my new function: transcript_data, lang_code_or_error = get_best_transcript(video_id).
I check the lang_code it gives me back. If it's not 'en', I automatically send the transcript to my AI translator function.

The user does nothing but click "Submit."

Here's the new logic in app.py:

# [Inside app.py]
if submit_button:
    if not youtube_url:
        st.warning("⚠ Please insert youtube url in sidebar!")
    else:
        video_id = extract_video_id(youtube_url)

        if video_id:
            full_transcript = None # Initialize variable
            lang_code = None     # Initialize variable

            with st.spinner("Step 1/3 : Fetching Video Transcripts..."):
                # --- THIS IS THE NEW LOGIC ---
                transcript_data, lang_code_or_error = get_best_transcript(video_id)

                if not transcript_data:
                    # If it failed, show the error and stop
                    st.error(f"Failed to get transcript: {lang_code_or_error}")
                else:
                    # If it succeeded, set our variables
                    lang_code = lang_code_or_error
                    full_transcript = " ".join([line.text for line in transcript_data])

            # This 'if' check ensures the rest only runs on success
            if full_transcript:

                # --- NEW AUTOMATIC TRANSLATION STEP ---
                if lang_code != 'en':
                    with st.spinner("Step 1.5/3 : Translating transcripts into English..."):
                        full_transcript = translate_text(full_transcript)

                # The rest of the code runs perfectly!
                with st.spinner("Step 2/3 : Fetching key topics..."):
                    # ...

                with st.spinner("Step 3/3 : Generating Notes..."):
                    # ...

                st.success("✅ Generated notes successfully!")

What I Learned

This was more than a feature update; it was a lesson in product design.

User Experience (UX) is a Feature: Removing a "feature" (the language input) made the app 10x better. A good app shouldn't make the user think.
Read the Docs (Again): The solution was right there in the youtube-transcript-api documentation (.list() and .find_transcript()). My first version was lazy because I didn't read deeply enough.
Encapsulate Logic: By putting all the "priority" logic into one function (get_best_transcript), my main app.py file stays clean and readable.

What's Next? (The Future Update)

This new automatic transcript-fetching logic is the foundation for the real goal.

You've probably noticed the "Chat with Video (v2)" page in the app. My next step is to use these high-quality transcripts to build a full RAG (Retrieval-Augmented Generation) system. Soon, you'll be able to ask questions and "chat" directly with the YouTube video.

Thank you for reading my update!

All My Links

➡️ Read Part 1 (The Bug Fixes): https://saitejaputtoju.hashnode.dev/ai-youtube-assistant
➡️ Try the Live App: https://ai-youtube-assistant-by-saiteja-puttoju.streamlit.app/
➡️ See the Full Code on GitHub: https://github.com/saiteja-puttoju/ai-youtube-assistant
➡️ Watch the New Demo Video: https://youtu.be/A6VCTTjInjE
➡️ Connect with me on LinkedIn: https://www.linkedin.com/in/saiteja-puttoju/

What's a "small" UI/UX change you've made that had a "big" impact on one of your projects? Let me know in the comments!