ALICS

(Artificial intelligence For Life Improvement and conceling support)

Alics Documentation

Magic Mirror

At the very first stage of the project I was thinking using all of the modules from the open source platform, but because of the node.js’s packages are constantly updating, a lot of the modules are poorly managed.

But during this process I was able to get to know the 11 lab api from the MMM-11-TTS module made by someone else. And Inspired me to go on with the hyper realistic voice over.

These are the links to the modules I used.

https://github.com/sdmydbr9/MMM-11-TTS

https://github.com/sdmydbr9/MMM-Chat

Python script ( youtube video)

I came across a youtube video of a guy making a bing ai bot voice assistant. In the script, he used amazon polly for the voice over, and was having no user interface as he was running the program purely in the terminal. So I went a head kind of stealing his code.

https://www.youtube.com/watch?v=aokn48vB0kc&t=220s

Code in video

Here is the code in the video. which I referenced the wakeword function and transcribe function

import openai
import asyncio
import re
import whisper
import boto3
import pydub
from pydub import playback
import speech_recognition as sr
from EdgeGPT import Chatbot, ConversationStyle

# Initialize the OpenAI API
openai.api_key = "[paste your OpenAI API key here]"

# Create a recognizer object and wake word variables
recognizer = sr.Recognizer()
BING_WAKE_WORD = "bing"
GPT_WAKE_WORD = "gpt"

def get_wake_word(phrase):
    if BING_WAKE_WORD in phrase.lower():
        return BING_WAKE_WORD
    elif GPT_WAKE_WORD in phrase.lower():
        return GPT_WAKE_WORD
    else:
        return None
    
def synthesize_speech(text, output_filename):
    polly = boto3.client('polly', region_name='us-west-2')
    response = polly.synthesize_speech(
        Text=text,
        OutputFormat='mp3',
        VoiceId='Salli',
        Engine='neural'
    )

    with open(output_filename, 'wb') as f:
        f.write(response['AudioStream'].read())

def play_audio(file):
    sound = pydub.AudioSegment.from_file(file, format="mp3")
    playback.play(sound)

async def main():
    while True:

        with sr.Microphone() as source:
            recognizer.adjust_for_ambient_noise(source)
            print(f"Waiting for wake words 'ok bing' or 'ok chat'...")
            while True:
                audio = recognizer.listen(source)
                try:
                    with open("audio.wav", "wb") as f:
                        f.write(audio.get_wav_data())
                    # Use the preloaded tiny_model
                    model = whisper.load_model("tiny")
                    result = model.transcribe("audio.wav")
                    phrase = result["text"]
                    print(f"You said: {phrase}")

                    wake_word = get_wake_word(phrase)
                    if wake_word is not None:
                        break
                    else:
                        print("Not a wake word. Try again.")
                except Exception as e:
                    print("Error transcribing audio: {0}".format(e))
                    continue

            print("Speak a prompt...")
            synthesize_speech('What can I help you with?', 'response.mp3')
            play_audio('response.mp3')
            audio = recognizer.listen(source)

            try:
                with open("audio_prompt.wav", "wb") as f:
                    f.write(audio.get_wav_data())
                model = whisper.load_model("base")
                result = model.transcribe("audio_prompt.wav")
                user_input = result["text"]
                print(f"You said: {user_input}")
            except Exception as e:
                print("Error transcribing audio: {0}".format(e))
                continue

            if wake_word == BING_WAKE_WORD:
                bot = Chatbot(cookie_path='cookies.json')
                response = await bot.ask(prompt=user_input, conversation_style=ConversationStyle.precise)
                # Select only the bot response from the response dictionary
                for message in response["item"]["messages"]:
                    if message["author"] == "bot":
                        bot_response = message["text"]
                # Remove [^#^] citations in response
                bot_response = re.sub('\[\^\d+\^\]', '', bot_response)

            else:
                # Send prompt to GPT-3.5-turbo API
                response = openai.ChatCompletion.create(
                    model="gpt-3.5-turbo",
                    messages=[
                        {"role": "system", "content":
                        "You are a helpful assistant."},
                        {"role": "user", "content": user_input},
                    ],
                    temperature=0.5,
                    max_tokens=150,
                    top_p=1,
                    frequency_penalty=0,
                    presence_penalty=0,
                    n=1,
                    stop=["\nUser:"],
                )

                bot_response = response["choices"][0]["message"]["content"]
                
        print("Bot's response:", bot_response)
        synthesize_speech(bot_response, 'response.mp3')
        play_audio('response.mp3')
        await bot.close()

if __name__ == "__main__":
    asyncio.run(main())

The actual built

I didn’t comment the code as I was making the script, and the script was 500 lines long, it would be too much to go back and comment every line. So I would just type out the explanation for each snippets.

Speech synthesis and play back

Replacing amazon polly

However, my intention was also to substitute the amazon polly with 11 lab api. Thats when the problem come in. In this video he simply used pydub sound library, and boto3 for the voice interaction part because amazon poly was able to give the pydub library a wav file to play back in time. But the 11 lab only could transport mp3 files, causing the whole process to be even slower.

Audio play speed

I asked gpt what to do with the slow play back speed, and it told me to transfer to another library called simpleaudio to transfer mp3 to wav and then play back to the user.

After I switched to simpleAudio, there happens to be a small chip noise at the start of the file, so I made it to delay play as I muted the first 44 bytes of the audio file.

speech_recognition library is used to capture and transcribe the user's voice input. The recognizer object is created to recognize speech from the microphone. energy_threshold is set to 300 to set the sensitivity of the whisper model.

simpleaudio and pydub libraries are used to play audio files. play_audio function loads an mp3 file, converts it to wav format and plays it. The function also sets is_audio_playing variable to True to check the audio status for video update purposes.

elevenlabs library is used to synthesize speech using the eleven-labs API. synthesize_speech function takes the text to be synthesized and the output filename, generates an audio file using the generate function and saves it using write function.

transcribe_audio_with_whisper function uses the whisper library to transcribe an audio file using a whisper model. The function loads the model with the chosen size and returns the transcribed text.

process_user_input function takes the user's input, adds a string "You said: " to it and puts the resulting string in the response_queue to be later displayed in the GUI.

Here is that part of the code

import speech_recognition as sr
import simpleaudio as sa
from pydub import AudioSegment
import warnings
from numba import NumbaDeprecationWarning
import queue
from elevenlabs import generate, play



warnings.filterwarnings("ignore", category=NumbaDeprecationWarning)

openai.api_key = f"[YOUR API KEY]]"

API_KEY = f"[YOUR API KEY]]]"#11 lab api
VOICE_ID = "[YOUR VOICE ID]"#voice id of your choice

recognizer = sr.Recognizer()
recognizer.energy_threshold = 300 #to set the sensitivity of the whisper model
GPT_WAKE_WORD = "hi alex"
GPT_SLEEP_WORD = "goodbye"


def synthesize_speech(text, output_filename):
     audio = generate(text, voice=VOICE_ID, api_key=API_KEY)
     with open(output_filename, 'wb') as f:
         f.write(audio)




def transcribe_audio_with_whisper(audio_file_path):
    model = whisper.load_model("base")  #choose the size of your whisper modle
    result = model.transcribe(audio_file_path)
    return result["text"].strip()

  
def play_audio(file):
    global is_audio_playing
    is_audio_playing = True #check for audio status for video update propose
    sound = AudioSegment.from_mp3(file)#expecting a mp3 file
    audio_data = sound.export(format="wav")#transfer into wav file
    audio_data = audio_data.read()

    
    audio_data = audio_data[44:] #skipping the first 44byte of the wav file to avoid a chip noise at the start of the recording.

    audio_wave = sa.WaveObject(audio_data, sound.channels, sound.sample_width, sound.frame_rate)
    play_obj = audio_wave.play()
    play_obj.wait_done()
    is_audio_playing = False


def process_user_input(user_input, response_queue):
    bot_response = "You said: " + user_input
    response_queue.put(bot_response)

Building the Sleep Command

Because the video tutorial has a little wake function, so I thought to build a sleep command. So whenever the sleep word is detected, the bot will go back to listening for the wake word. So in an ideal situation, the user would always be able to hang the mirror on the wall.

The assistant listens for a wake word, which is set to the value of GPT_WAKE_WORD. When it detects the wake word, it starts listening for user input and transcribes it to text using a speech recognition library.

The user input is then passed to the GPT-3 API using OpenAI's openai.ChatCompletion.create() method to generate a response. The response is then synthesized into an audio file using a text-to-speech library and played back to the user.

The assistant also listens for a sleep word, which is set to the value of GPT_SLEEP_WORD, to stop listening for user input and go back to listening for the wake word. During this transition, the assistant plays an audio file that indicates it's going to sleep.

The main function main_with_gui() uses a GUI to display the assistant's responses and settings. The response_queue is a queue that holds the bot's responses and text_var is a tkinter variable used to update the GUI display. The settings parameter is a dictionary that contains the settings for the GPT-3 API such as the model, the maximum number of tokens, and the system message to display to the user.

here is the code

def get_wake_word(phrase):
    
   if GPT_WAKE_WORD in phrase.lower():
       return GPT_WAKE_WORD
   else:
       return None
  


def get_sleep_word(phrase):
   if GPT_SLEEP_WORD in phrase.lower():
       return GPT_SLEEP_WORD
   else:
       return None

async def main_with_gui(response_queue, text_var, settings):
    wake_word_detected = False
    greeting_played = False

    while True:
        with sr.Microphone() as source:
            recognizer.adjust_for_ambient_noise(source)

            if not wake_word_detected:
                print(f"Say {GPT_WAKE_WORD} to start a conversation...")

                while True:
                    audio = recognizer.listen(source)
                    audio_file = "audio.wav"
                    with open(audio_file, "wb") as f:
                        f.write(audio.get_wav_data())

                    phrase = transcribe_audio_with_whisper(audio_file)
                    print(f"Phrase: {phrase}")

                    if get_wake_word(phrase) is not None:
                        wake_word_detected = True
                        break
                    else:
                        print("Not a wake word. Try again.")

            if not greeting_played:
      
                play_audio('greetings.mp3')
                greeting_played = True

            while wake_word_detected:
                print("Speak a prompt...")
                audio = recognizer.listen(source)
                audio_file = "audio_prompt.wav"
                with open(audio_file, "wb") as f:
                    f.write(audio.get_wav_data())

                user_input = transcribe_audio_with_whisper(audio_file)

                print(f"User input: {user_input}")

                if get_sleep_word(user_input) is not None:
                    wake_word_detected = False
                    greeting_played = False
                    print("Sleep word detected, going back to listening for wake word.")

                  
                    play_audio('sleep.mp3')

                 
                    text_var.set("")

                    break

                response = openai.ChatCompletion.create(
                    model="gpt-3.5-turbo",
                    messages=[
                        {"role": "system", "content": settings['system_message']},
                        {"role": "user", "content": user_input},
                    ],
                    temperature=0.6,
                    max_tokens=settings['max_tokens'],
                    top_p=1,
                    frequency_penalty=0,
                    presence_penalty=0,
                    n=1,
                    stop=["\nUser:"],
                )



                bot_response = response["choices"][0]["message"]["content"]

   
                response_queue.put(bot_response)

            
                synthesize_speech(bot_response, 'response.mp3')

      
                play_audio('response.mp3')

Building the user Interface

I was not sure what to use for the user interface, So I just googled and used ptinker.

In the process, I switched out text widget and used label for text display as it looked more modern.

The create_gui() function creates the main window of the interface and sets its title, menu bar, and dimensions. It also initializes a dictionary called settings with some default values that are used to control the behavior of the program, such as the maximum number of tokens to use when generating responses, and a message that is displayed to the user when the program starts.

The screen_width and screen_height variables are used to set the dimensions of the window to fill the entire screen.

The menu_bar object is created using the Menu widget, and a settings_menu is created as a dropdown menu on the menu bar. The settings_menu contains one option: "Open Settings", which is used to open a settings window when clicked.

The video_frame is created to hold the video stream, and the text_frame is created to display the program's responses to the user.

The update_video_label() function is responsible for updating the video stream shown in the GUI. It reads frames from two video files (one showing a static image, and the other showing a talking AI), resizes them, and then displays them on the GUI. The function uses the Label widget from Tkinter to display the video stream.

The text_var and text_widget variables are used to display the program's responses to the user in the text_frame created earlier. The text_widget is created using the Label widget from Tkinter.

Finally, a response_queue is created to hold the program's responses, and two threads are started: one to run the main program logic, and another to update the text_widget with the program's responses.

Here is the code

def create_gui():
    root = Tk()
    root.title("ALICS (Artificial Intelligence for Life Improvement and Counseling Support)")
    menu_bar= tk.Menu(root)
    root.config(menu=menu_bar)
    settings_menu = tk.Menu(menu_bar, tearoff=0)
    menu_bar.add_cascade(label="Settings", menu=settings_menu)
    settings_menu.add_command(label="Open Settings", command=lambda: show_settings_window(settings))
    settings = {
        'system_message': "Your name is Alice, an acronym for Artistic Intelligence for Life Improvement and Counseling Support. As a creative and insightful personal therapist, your mission is to help clients address their problems with touch of humor. Make an effort to connect with clients on a personal level by sharing relevant anecdotes or insightful metaphors when appropriate.",
        'max_tokens': 150,
        'settings_open': False
    }
    screen_width = root.winfo_screenwidth()
    screen_height = root.winfo_screenheight()
    root.geometry(f"{screen_width}x{screen_height}")
    root.configure(bg="black")

    system_message_var = StringVar()
    system_message_var.set(settings['system_message'])
    
    max_tokens_var = IntVar()
    max_tokens_var.set(150)

   
    video_frame = Frame(root, bg="black")
    video_frame.pack(side="top", pady=(10, 0), anchor="center", expand=True)


    video_file1 = "/Users/giaogiaoguo/Desktop/interface/Alics_vid/Defult2.MOV"
    video_file2 = "/Users/giaogiaoguo/Desktop/interface/Alics_vid/SPeak.MOV"

  
    video_capture1 = cv2.VideoCapture(video_file1)
    video_capture2 = cv2.VideoCapture(video_file2)

    
    second_video_frame_rate = 60 
    video_capture2.set(cv2.CAP_PROP_FPS, second_video_frame_rate)


    def update_video_label():
        global is_audio_playing, video_capture1, video_capture2

        if is_audio_playing:
            video_capture = video_capture2
            scale_percent = 30
            frame_rate=100
        else:
            video_capture = video_capture1
            scale_percent = 30  
            frame_rate=100
        ret, frame = video_capture.read()
        if not ret:
            video_capture.set(cv2.CAP_PROP_POS_FRAMES, 0)
            ret, frame = video_capture.read()

        cv2image = cv2.cvtColor(frame, cv2.COLOR_BGR2RGBA)

        width = int(cv2image.shape[1] * scale_percent / 100)
        height = int(cv2image.shape[0] * scale_percent / 100)

        dim = (width, height)
        resized = cv2.resize(cv2image, dim, interpolation=cv2.INTER_AREA)

        img = Image.fromarray(resized)
        imgtk = ImageTk.PhotoImage(image=img)
        label.config(image=imgtk)
        label.imgtk = imgtk
        
        delay= int(1000 / frame_rate)
        root.after(delay, update_video_label)

    
    label = Label(video_frame, bg="black") #using lable instead of text widget. 
    label.pack(side="top", anchor="center")

    update_video_label()

    
    text_frame = Frame(root, bg="black")
    text_frame.pack(side="top", pady=(50, 20), anchor="center", expand=True)

    text_var = StringVar()
    text_widget = tk.Label(text_frame, textvariable=text_var, wraplength=1000, bg="black", fg="white", font=("Nanum Gothic", 12))
    text_widget.pack(expand=True, fill=BOTH, anchor="center")


    response_queue = queue.Queue()
    threading.Thread(target=lambda: asyncio.run(main_with_gui(response_queue, text_var, settings)), daemon=True).start()

    threading.Thread(target=update_text_widget, args=(text_widget, text_var, response_queue, root), daemon=True).start()

    root.mainloop()


def update_text_widget(text_widget, text_var, response_queue, root):
    while True:
        response = response_queue.get()
        text_var.set(f"ALICS: {response}\n")

Animation, Logo and Clone voice

This part of the documentation could be found here.

Here is the 11 lab official documentation for python code, the use of the clone function is in here

https://github.com/elevenlabs/elevenlabs-python

Setting Function

I have singed out the “max token”and “system message”

This code defines two functions related to the settings window of the application:

create_settings_window(settings): This function creates a new window using the Tk() method and sets the title of the window as "ALICS Settings". It then defines a function called save_settings() which updates the system_message and max_tokens values in the settings dictionary with the values entered in the respective Entry widgets. After updating the values, the window is destroyed and the settings_open flag is set to False.

The function then creates two Label widgets for the system message and max tokens and an Entry widget for each label to allow the user to enter their own values. The initial values of the Entry widgets are set to the current values stored in the settings dictionary. A Save button is also created which calls the save_settings() function when clicked.

show_settings_window(settings): This function checks whether the settings window is already open by checking the value of the settings_open flag in the settings dictionary. If the window is not open, it sets the flag to True and calls the create_settings_window() function to create and display the settings window. If the window is already open, the function simply prints a message to indicate that the window is already open.

def create_settings_window(settings):
    settings_window = Tk()
    settings_window.title("ALICS Settings")

    def save_settings():
        settings['system_message'] = system_message_entry.get()
        settings['max_tokens'] = int(max_tokens_entry.get())
        settings_window.destroy()
        settings['settings_open'] = False


    system_message_label = tk.Label(settings_window, text="System message:")
    system_message_label.grid(row=0, column=0, sticky="e", padx=5, pady=5)
    system_message_entry = tk.Entry(settings_window)
    system_message_entry.insert(0, settings['system_message'])
    system_message_entry.grid(row=0, column=1, padx=5, pady=5)


    max_tokens_label = tk.Label(settings_window, text="Max tokens:")
    max_tokens_label.grid(row=1, column=0, sticky="e", padx=5, pady=5)
    max_tokens_entry = tk.Entry(settings_window)
    max_tokens_entry.insert(0, settings['max_tokens'])
    max_tokens_entry.grid(row=1, column=1, padx=5, pady=5)

 
    save_button = tk.Button(settings_window, text="Save", command=save_settings)
    save_button.grid(row=2, column=1, padx=5, pady=5, sticky="e")

    settings_window.mainloop()

def show_settings_window(settings):
    print("show_settings_window called")
    if not settings['settings_open']:
        settings['settings_open'] = True
        create_settings_window(settings)

The final product code and installation instruction

if you want to try this code, REMEMBER TO RUN IN ANACONDA WITH 3.9 PYTHON VERSION. ANYTHING ELESE WOULD NOT WORK!!!! BECAUSE OF NUMBA IS VERY UNDER DEVELOPED

NECESSARY PACKAGES TO INSTALL TO RUN THE PROJECT

pip install -U openai-whisper

pip install SpeechRecognition

pip install elevenlabs

pip install simpleaudio

pip install PyAudio // Please install pyaudio before pydub

pip install pydub

pip install opencv-python

Alics with bella voice

import cv2
from tkinter import *
import tkinter as tk
from PIL import Image, ImageTk
import threading
import openai
import asyncio
import whisper
import speech_recognition as sr
import simpleaudio as sa
from pydub import AudioSegment
import warnings
from numba import NumbaDeprecationWarning
import queue
from elevenlabs import generate, play



warnings.filterwarnings("ignore", category=NumbaDeprecationWarning)

openai.api_key = f"[YOUR API KEY]]"



recognizer = sr.Recognizer()
recognizer.energy_threshold = 300
GPT_WAKE_WORD = "hi alex"
GPT_SLEEP_WORD = "goodbye"
is_audio_playing = False
video_file1 = "/Users/giaogiaoguo/Desktop/interface/Alics_vid/Defult2.MOV"
video_file2 = "/Users/giaogiaoguo/Desktop/interface/Alics_vid/SPeak.MOV"
video_capture1 = cv2.VideoCapture(video_file1)
video_capture2 = cv2.VideoCapture(video_file2)

def create_settings_window(settings):
    settings_window = Tk()
    settings_window.title("ALICS Settings")

    def save_settings():
        settings['system_message'] = system_message_entry.get()
        settings['max_tokens'] = int(max_tokens_entry.get())
        settings_window.destroy()
        settings['settings_open'] = False


    system_message_label = tk.Label(settings_window, text="System message:")
    system_message_label.grid(row=0, column=0, sticky="e", padx=5, pady=5)
    system_message_entry = tk.Entry(settings_window)
    system_message_entry.insert(0, settings['system_message'])
    system_message_entry.grid(row=0, column=1, padx=5, pady=5)


    max_tokens_label = tk.Label(settings_window, text="Max tokens:")
    max_tokens_label.grid(row=1, column=0, sticky="e", padx=5, pady=5)
    max_tokens_entry = tk.Entry(settings_window)
    max_tokens_entry.insert(0, settings['max_tokens'])
    max_tokens_entry.grid(row=1, column=1, padx=5, pady=5)

 
    save_button = tk.Button(settings_window, text="Save", command=save_settings)
    save_button.grid(row=2, column=1, padx=5, pady=5, sticky="e")

    settings_window.mainloop()

def show_settings_window(settings):
    print("show_settings_window called")
    if not settings['settings_open']:
        settings['settings_open'] = True
        create_settings_window(settings)

def get_wake_word(phrase):
    
   if GPT_WAKE_WORD in phrase.lower():
       return GPT_WAKE_WORD
   else:
       return None
  


def get_sleep_word(phrase):
   if GPT_SLEEP_WORD in phrase.lower():
       return GPT_SLEEP_WORD
   else:
       return None   



API_KEY = f"[YOUR API KEY]]]"
VOICE_ID = "[YOUR VOICE ID]"


def synthesize_speech(text, output_filename):
     audio = generate(text, voice=VOICE_ID, api_key=API_KEY)
     with open(output_filename, 'wb') as f:
         f.write(audio)




def transcribe_audio_with_whisper(audio_file_path):
    model = whisper.load_model("base")  
    result = model.transcribe(audio_file_path)
    return result["text"].strip()

  
def play_audio(file):
    global is_audio_playing
    is_audio_playing = True
    sound = AudioSegment.from_mp3(file) #
    audio_data = sound.export(format="wav")
    audio_data = audio_data.read()

    
    audio_data = audio_data[44:]

    audio_wave = sa.WaveObject(audio_data, sound.channels, sound.sample_width, sound.frame_rate)
    play_obj = audio_wave.play()
    play_obj.wait_done()
    is_audio_playing = False


def process_user_input(user_input, response_queue):
    bot_response = "You said: " + user_input
    response_queue.put(bot_response)





async def main_with_gui(response_queue, text_var, settings):
    wake_word_detected = False
    greeting_played = False

    while True:
        with sr.Microphone() as source:
            recognizer.adjust_for_ambient_noise(source)

            if not wake_word_detected:
                print(f"Say {GPT_WAKE_WORD} to start a conversation...")

                while True:
                    audio = recognizer.listen(source)
                    audio_file = "audio.wav"
                    with open(audio_file, "wb") as f:
                        f.write(audio.get_wav_data())

                    phrase = transcribe_audio_with_whisper(audio_file)
                    print(f"Phrase: {phrase}")

                    if get_wake_word(phrase) is not None:
                        wake_word_detected = True
                        break
                    else:
                        print("Not a wake word. Try again.")

            if not greeting_played:
      
                play_audio('greetings.mp3')
                greeting_played = True

            while wake_word_detected:
                print("Speak a prompt...")
                audio = recognizer.listen(source)
                audio_file = "audio_prompt.wav"
                with open(audio_file, "wb") as f:
                    f.write(audio.get_wav_data())

                user_input = transcribe_audio_with_whisper(audio_file)

                print(f"User input: {user_input}")

                if get_sleep_word(user_input) is not None:
                    wake_word_detected = False
                    greeting_played = False
                    print("Sleep word detected, going back to listening for wake word.")

                  
                    play_audio('sleep.mp3')

                 
                    text_var.set("")

                    break

                response = openai.ChatCompletion.create(
                    model="gpt-3.5-turbo",
                    messages=[
                        {"role": "system", "content": settings['system_message']},
                        {"role": "user", "content": user_input},
                    ],
                    temperature=0.6,
                    max_tokens=settings['max_tokens'],
                    top_p=1,
                    frequency_penalty=0,
                    presence_penalty=0,
                    n=1,
                    stop=["\nUser:"],
                )



                bot_response = response["choices"][0]["message"]["content"]

   
                response_queue.put(bot_response)

            
                synthesize_speech(bot_response, 'response.mp3')

      
                play_audio('response.mp3')
                

def create_gui():
    root = Tk()
    root.title("ALICS (Artificial Intelligence for Life Improvement and Counseling Support)")
    menu_bar= tk.Menu(root)
    root.config(menu=menu_bar)
    settings_menu = tk.Menu(menu_bar, tearoff=0)
    menu_bar.add_cascade(label="Settings", menu=settings_menu)
    settings_menu.add_command(label="Open Settings", command=lambda: show_settings_window(settings))
    settings = {
        'system_message': "Your name is Alice, an acronym for Artistic Intelligence for Life Improvement and Counseling Support. As a creative and insightful personal therapist, your mission is to help clients address their problems with touch of humor. Make an effort to connect with clients on a personal level by sharing relevant anecdotes or insightful metaphors when appropriate.",
        'max_tokens': 150,
        'settings_open': False
    }
    screen_width = root.winfo_screenwidth()
    screen_height = root.winfo_screenheight()
    root.geometry(f"{screen_width}x{screen_height}")
    root.configure(bg="black")

    system_message_var = StringVar()
    system_message_var.set(settings['system_message'])
    
    max_tokens_var = IntVar()
    max_tokens_var.set(150)

   
    video_frame = Frame(root, bg="black")
    video_frame.pack(side="top", pady=(10, 0), anchor="center", expand=True)


    video_file1 = "/Users/giaogiaoguo/Desktop/interface/Alics_vid/Defult2.MOV"
    video_file2 = "/Users/giaogiaoguo/Desktop/interface/Alics_vid/SPeak.MOV"

  
    video_capture1 = cv2.VideoCapture(video_file1)
    video_capture2 = cv2.VideoCapture(video_file2)

    
    second_video_frame_rate = 60 
    video_capture2.set(cv2.CAP_PROP_FPS, second_video_frame_rate)


    def update_video_label():
        global is_audio_playing, video_capture1, video_capture2

        if is_audio_playing:
            video_capture = video_capture2
            scale_percent = 30
            frame_rate=100
        else:
            video_capture = video_capture1
            scale_percent = 30  
            frame_rate=100
        ret, frame = video_capture.read()
        if not ret:
            video_capture.set(cv2.CAP_PROP_POS_FRAMES, 0)
            ret, frame = video_capture.read()

        cv2image = cv2.cvtColor(frame, cv2.COLOR_BGR2RGBA)

        width = int(cv2image.shape[1] * scale_percent / 100)
        height = int(cv2image.shape[0] * scale_percent / 100)

        dim = (width, height)
        resized = cv2.resize(cv2image, dim, interpolation=cv2.INTER_AREA)

        img = Image.fromarray(resized)
        imgtk = ImageTk.PhotoImage(image=img)
        label.config(image=imgtk)
        label.imgtk = imgtk
        
        delay= int(1000 / frame_rate)
        root.after(delay, update_video_label)

    
    label = Label(video_frame, bg="black")
    label.pack(side="top", anchor="center")

    update_video_label()

    
    text_frame = Frame(root, bg="black")
    text_frame.pack(side="top", pady=(50, 20), anchor="center", expand=True)

    text_var = StringVar()
    text_widget = tk.Label(text_frame, textvariable=text_var, wraplength=1000, bg="black", fg="white", font=("Nanum Gothic", 12))
    text_widget.pack(expand=True, fill=BOTH, anchor="center")


    response_queue = queue.Queue()
    threading.Thread(target=lambda: asyncio.run(main_with_gui(response_queue, text_var, settings)), daemon=True).start()

    threading.Thread(target=update_text_widget, args=(text_widget, text_var, response_queue, root), daemon=True).start()

    root.mainloop()


def update_text_widget(text_widget, text_var, response_queue, root):
    while True:
        response = response_queue.get()
        text_var.set(f"ALICS: {response}\n")  

if __name__ == "__main__":
   create_gui()

Alics with clone voice

import cv2
from tkinter import *
import tkinter as tk
from PIL import Image, ImageTk
import threading
import openai
import asyncio
import whisper
import speech_recognition as sr
import simpleaudio as sa
from pydub import AudioSegment
import queue
import elevenlabs
from elevenlabs import generate, play
from elevenlabs import clone, generate as generate_voice
from elevenlabs import set_api_key


set_api_key(f"[YOUR API KEY]]")



openai.api_key = "[YOUR API KEY]]"



recognizer = sr.Recognizer()
recognizer.energy_threshold = 300
GPT_WAKE_WORD = "hi alex"
GPT_SLEEP_WORD = "goodbye"
is_audio_playing = False
video_file1 = "/Users/giaogiaoguo/Desktop/interface/Alics_vid/Defult2.MOV"
video_file2 = "/Users/giaogiaoguo/Desktop/interface/Alics_vid/SPeak.MOV"
video_capture1 = cv2.VideoCapture(video_file1)
video_capture2 = cv2.VideoCapture(video_file2)

def create_settings_window(settings):
    settings_window = Tk()
    settings_window.title("ALICS Settings")

    def save_settings():
        settings['system_message'] = system_message_entry.get()
        settings['max_tokens'] = int(max_tokens_entry.get())
        settings_window.destroy()
        settings['settings_open'] = False

    system_message_label = tk.Label(settings_window, text="System message:")
    system_message_label.grid(row=0, column=0, sticky="e", padx=5, pady=5)
    system_message_entry = tk.Entry(settings_window)
    system_message_entry.insert(0, settings['system_message'])
    system_message_entry.grid(row=0, column=1, padx=5, pady=5)

    max_tokens_label = tk.Label(settings_window, text="Max tokens:")
    max_tokens_label.grid(row=1, column=0, sticky="e", padx=5, pady=5)
    max_tokens_entry = tk.Entry(settings_window)
    max_tokens_entry.insert(0, settings['max_tokens'])
    max_tokens_entry.grid(row=1, column=1, padx=5, pady=5)


    save_button = tk.Button(settings_window, text="Save", command=save_settings)
    save_button.grid(row=2, column=1, padx=5, pady=5, sticky="e")

    settings_window.mainloop()

def show_settings_window(settings):
    print("show_settings_window called")
    if not settings['settings_open']:
        settings['settings_open'] = True
        create_settings_window(settings)

def get_wake_word(phrase):
    
   if GPT_WAKE_WORD in phrase.lower():
       return GPT_WAKE_WORD
   else:
       return None
  


def get_sleep_word(phrase):
   if GPT_SLEEP_WORD in phrase.lower():
       return GPT_SLEEP_WORD
   else:
       return None   

def clone_voice():
    voice = clone(
        name="yuris",
        description="",
        files=[""],
    )
    return voice

voice = clone_voice()






def synthesize_speech(text, output_filename):
    audio = generate_voice(text=text, voice=voice)
    with open(output_filename, 'wb') as f:
        f.write(audio)




def transcribe_audio_with_whisper(audio_file_path):
    model = whisper.load_model("base") 
    result = model.transcribe(audio_file_path)
    return result["text"].strip()

  
def play_audio(file):
    global is_audio_playing
    is_audio_playing = True
    sound = AudioSegment.from_mp3(file)
    audio_data = sound.export(format="wav")
    audio_data = audio_data.read()

 
    audio_data = audio_data[44:]

    audio_wave = sa.WaveObject(audio_data, sound.channels, sound.sample_width, sound.frame_rate)
    play_obj = audio_wave.play()
    play_obj.wait_done()
    is_audio_playing = False


def process_user_input(user_input, response_queue):
    bot_response = "You said: " + user_input
    response_queue.put(bot_response)





async def main_with_gui(response_queue, text_var, settings):
    wake_word_detected = False
    greeting_played = False

    while True:
        with sr.Microphone() as source:
            recognizer.adjust_for_ambient_noise(source)

            if not wake_word_detected:
                print(f"Say {GPT_WAKE_WORD} to start a conversation...")

                while True:
                    audio = recognizer.listen(source)
                    audio_file = "audio.wav"
                    with open(audio_file, "wb") as f:
                        f.write(audio.get_wav_data())

                    phrase = transcribe_audio_with_whisper(audio_file)
                    print(f"Phrase: {phrase}")

                    if get_wake_word(phrase) is not None:
                        wake_word_detected = True
                        break
                    else:
                        print("Not a wake word. Try again.")

            if not greeting_played:
         
                play_audio('greetings.mp3')
                greeting_played = True

            while wake_word_detected:
                print("Speak a prompt...")
                audio = recognizer.listen(source)
                audio_file = "audio_prompt.wav"
                with open(audio_file, "wb") as f:
                    f.write(audio.get_wav_data())

                user_input = transcribe_audio_with_whisper(audio_file)

                print(f"User input: {user_input}")

                if get_sleep_word(user_input) is not None:
                    wake_word_detected = False
                    greeting_played = False
                    print("Sleep word detected, going back to listening for wake word.")

                    play_audio('sleep.mp3')

     
                    text_var.set("")

                    break

                response = openai.ChatCompletion.create(
                    model="gpt-3.5-turbo",
                    messages=[
                        {"role": "system", "content": settings['system_message']},
                        {"role": "user", "content": user_input},
                    ],
                    temperature=0.6,
                    max_tokens=settings['max_tokens'],
                    top_p=1,
                    frequency_penalty=0,
                    presence_penalty=0,
                    n=1,
                    stop=["\nUser:"],
                )



                bot_response = response["choices"][0]["message"]["content"]

      
                response_queue.put(bot_response)

   
                synthesize_speech(bot_response, 'response.mp3')

     
                play_audio('response.mp3')
                

def create_gui():
    root = Tk()
    root.title("ALICS (Artificial Intelligence for Life Improvement and Counseling Support)")
    menu_bar= tk.Menu(root)
    root.config(menu=menu_bar)
    settings_menu = tk.Menu(menu_bar, tearoff=0)
    menu_bar.add_cascade(label="Settings", menu=settings_menu)
    settings_menu.add_command(label="Open Settings", command=lambda: show_settings_window(settings))
    settings = {
        'system_message': "Your name is Yuris, You are a 19 years old college student who speaks with a sarcastic tone. You are talking to a friend who is trying to imporve her life",
        'max_tokens': 150,
        'settings_open': False
    }
    screen_width = root.winfo_screenwidth()
    screen_height = root.winfo_screenheight()
    root.geometry(f"{screen_width}x{screen_height}")
    root.configure(bg="black")

    system_message_var = StringVar()
    system_message_var.set(settings['system_message'])
    
    max_tokens_var = IntVar()
    max_tokens_var.set(150)

    
    video_frame = Frame(root, bg="black")
    video_frame.pack(side="top", pady=(10, 0), anchor="center", expand=True)


    video_file1 = "/Users/giaogiaoguo/Desktop/interface/Alics_vid/Defult2.MOV"
    video_file2 = "/Users/giaogiaoguo/Desktop/interface/Alics_vid/SPeak.MOV"


    video_capture1 = cv2.VideoCapture(video_file1)
    video_capture2 = cv2.VideoCapture(video_file2)

    second_video_frame_rate = 60  
    video_capture2.set(cv2.CAP_PROP_FPS, second_video_frame_rate)


    def update_video_label():
        global is_audio_playing, video_capture1, video_capture2

        if is_audio_playing:
            video_capture = video_capture2
            scale_percent = 30
            frame_rate=100
        else:
            video_capture = video_capture1
            scale_percent = 30  
            frame_rate=100
        ret, frame = video_capture.read()
        if not ret:
            video_capture.set(cv2.CAP_PROP_POS_FRAMES, 0)
            ret, frame = video_capture.read()

        cv2image = cv2.cvtColor(frame, cv2.COLOR_BGR2RGBA)

        width = int(cv2image.shape[1] * scale_percent / 100)
        height = int(cv2image.shape[0] * scale_percent / 100)

        dim = (width, height)
        resized = cv2.resize(cv2image, dim, interpolation=cv2.INTER_AREA)

        img = Image.fromarray(resized)
        imgtk = ImageTk.PhotoImage(image=img)
        label.config(image=imgtk)
        label.imgtk = imgtk
        
        delay= int(1000 / frame_rate)
        root.after(delay, update_video_label)

    
    label = Label(video_frame, bg="black")
    label.pack(side="top", anchor="center")

    update_video_label()

    
    text_frame = Frame(root, bg="black")
    text_frame.pack(side="top", pady=(50, 20), anchor="center", expand=True)

    text_var = StringVar()
    text_widget = tk.Label(text_frame, textvariable=text_var, wraplength=1000, bg="black", fg="white", font=("Nanum Gothic", 12))
    text_widget.pack(expand=True, fill=BOTH, anchor="center")


    response_queue = queue.Queue()
    threading.Thread(target=lambda: asyncio.run(main_with_gui(response_queue, text_var, settings)), daemon=True).start()

    threading.Thread(target=update_text_widget, args=(text_widget, text_var, response_queue, root), daemon=True).start()

    root.mainloop()


def update_text_widget(text_var, response_queue):
    while True:
       
        response = response_queue.get()
        text_var.set(f"YOUR_NAME: {response}\n") 

if __name__ == "__main__":
   create_gui()

Other parts -presentation, mockup, wood work

Presentation

user (or just me) test video

https://drive.google.com/file/d/1R6R-iFa3MOt5GV2t6GgkX3iQAwd303YJ/view?usp=sharing

https://drive.google.com/file/d/1XgiaYMAZ7weIPeZj44BVz_MF-3Djcl12/view?usp=sharing

Mock up

https://drive.google.com/file/d/19YWYeaSprH1ol_JSFCQcweeg-lxW9e5k/view?usp=sharing

The All-in-One Video & Filmmakers Platform | Motion Array

Create your projects with unlimited asset downloads: premium Templates, Presets, Stock Photos, video elements and a website builder - all in one membership!

https://motionarray.com/?utm_source=google&utm_medium=cpc&utm_campaign=13705712800&utm_content=131044932944&utm_term=motion%20array&keyword=motion%20array&ad=599871144615&matchtype=e&device=c&gclid=EAIaIQobChMI743IyvPk_gIVNgyzAB3wQQEPEAAYASAAEgKyT_D_BwE