ALICS
(Artificial intelligence For Life Improvement and conceling support)
Alics Documentation
Magic Mirror
At the very first stage of the project I was thinking using all of the modules from the open source platform, but because of the node.js’s packages are constantly updating, a lot of the modules are poorly managed.
But during this process I was able to get to know the 11 lab api from the MMM-11-TTS module made by someone else. And Inspired me to go on with the hyper realistic voice over.
These are the links to the modules I used.
Python script ( youtube video)
I came across a youtube video of a guy making a bing ai bot voice assistant. In the script, he used amazon polly for the voice over, and was having no user interface as he was running the program purely in the terminal. So I went a head kind of stealing his code.
Code in video
Here is the code in the video. which I referenced the wakeword function and transcribe function
import openai
import asyncio
import re
import whisper
import boto3
import pydub
from pydub import playback
import speech_recognition as sr
from EdgeGPT import Chatbot, ConversationStyle
# Initialize the OpenAI API
openai.api_key = "[paste your OpenAI API key here]"
# Create a recognizer object and wake word variables
recognizer = sr.Recognizer()
BING_WAKE_WORD = "bing"
GPT_WAKE_WORD = "gpt"
def get_wake_word(phrase):
if BING_WAKE_WORD in phrase.lower():
return BING_WAKE_WORD
elif GPT_WAKE_WORD in phrase.lower():
return GPT_WAKE_WORD
else:
return None
def synthesize_speech(text, output_filename):
polly = boto3.client('polly', region_name='us-west-2')
response = polly.synthesize_speech(
Text=text,
OutputFormat='mp3',
VoiceId='Salli',
Engine='neural'
)
with open(output_filename, 'wb') as f:
f.write(response['AudioStream'].read())
def play_audio(file):
sound = pydub.AudioSegment.from_file(file, format="mp3")
playback.play(sound)
async def main():
while True:
with sr.Microphone() as source:
recognizer.adjust_for_ambient_noise(source)
print(f"Waiting for wake words 'ok bing' or 'ok chat'...")
while True:
audio = recognizer.listen(source)
try:
with open("audio.wav", "wb") as f:
f.write(audio.get_wav_data())
# Use the preloaded tiny_model
model = whisper.load_model("tiny")
result = model.transcribe("audio.wav")
phrase = result["text"]
print(f"You said: {phrase}")
wake_word = get_wake_word(phrase)
if wake_word is not None:
break
else:
print("Not a wake word. Try again.")
except Exception as e:
print("Error transcribing audio: {0}".format(e))
continue
print("Speak a prompt...")
synthesize_speech('What can I help you with?', 'response.mp3')
play_audio('response.mp3')
audio = recognizer.listen(source)
try:
with open("audio_prompt.wav", "wb") as f:
f.write(audio.get_wav_data())
model = whisper.load_model("base")
result = model.transcribe("audio_prompt.wav")
user_input = result["text"]
print(f"You said: {user_input}")
except Exception as e:
print("Error transcribing audio: {0}".format(e))
continue
if wake_word == BING_WAKE_WORD:
bot = Chatbot(cookie_path='cookies.json')
response = await bot.ask(prompt=user_input, conversation_style=ConversationStyle.precise)
# Select only the bot response from the response dictionary
for message in response["item"]["messages"]:
if message["author"] == "bot":
bot_response = message["text"]
# Remove [^#^] citations in response
bot_response = re.sub('\[\^\d+\^\]', '', bot_response)
else:
# Send prompt to GPT-3.5-turbo API
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content":
"You are a helpful assistant."},
{"role": "user", "content": user_input},
],
temperature=0.5,
max_tokens=150,
top_p=1,
frequency_penalty=0,
presence_penalty=0,
n=1,
stop=["\nUser:"],
)
bot_response = response["choices"][0]["message"]["content"]
print("Bot's response:", bot_response)
synthesize_speech(bot_response, 'response.mp3')
play_audio('response.mp3')
await bot.close()
if __name__ == "__main__":
asyncio.run(main())
The actual built
I didn’t comment the code as I was making the script, and the script was 500 lines long, it would be too much to go back and comment every line. So I would just type out the explanation for each snippets.
Speech synthesis and play back
Replacing amazon polly
However, my intention was also to substitute the amazon polly with 11 lab api. Thats when the problem come in. In this video he simply used pydub sound library, and boto3 for the voice interaction part because amazon poly was able to give the pydub library a wav file to play back in time. But the 11 lab only could transport mp3 files, causing the whole process to be even slower.
Audio play speed
I asked gpt what to do with the slow play back speed, and it told me to transfer to another library called simpleaudio to transfer mp3 to wav and then play back to the user.
After I switched to simpleAudio, there happens to be a small chip noise at the start of the file, so I made it to delay play as I muted the first 44 bytes of the audio file.
speech_recognition
library is used to capture and transcribe the user's voice input. The recognizer
object is created to recognize speech from the microphone. energy_threshold
is set to 300 to set the sensitivity of the whisper model.
simpleaudio
and pydub
libraries are used to play audio files. play_audio
function loads an mp3 file, converts it to wav format and plays it. The function also sets is_audio_playing
variable to True
to check the audio status for video update purposes.
elevenlabs
library is used to synthesize speech using the eleven-labs API. synthesize_speech
function takes the text to be synthesized and the output filename, generates an audio file using the generate
function and saves it using write
function.
transcribe_audio_with_whisper
function uses the whisper
library to transcribe an audio file using a whisper model. The function loads the model with the chosen size and returns the transcribed text.
process_user_input
function takes the user's input, adds a string "You said: "
to it and puts the resulting string in the response_queue
to be later displayed in the GUI.
Here is that part of the code
import speech_recognition as sr
import simpleaudio as sa
from pydub import AudioSegment
import warnings
from numba import NumbaDeprecationWarning
import queue
from elevenlabs import generate, play
warnings.filterwarnings("ignore", category=NumbaDeprecationWarning)
openai.api_key = f"[YOUR API KEY]]"
API_KEY = f"[YOUR API KEY]]]"#11 lab api
VOICE_ID = "[YOUR VOICE ID]"#voice id of your choice
recognizer = sr.Recognizer()
recognizer.energy_threshold = 300 #to set the sensitivity of the whisper model
GPT_WAKE_WORD = "hi alex"
GPT_SLEEP_WORD = "goodbye"
def synthesize_speech(text, output_filename):
audio = generate(text, voice=VOICE_ID, api_key=API_KEY)
with open(output_filename, 'wb') as f:
f.write(audio)
def transcribe_audio_with_whisper(audio_file_path):
model = whisper.load_model("base") #choose the size of your whisper modle
result = model.transcribe(audio_file_path)
return result["text"].strip()
def play_audio(file):
global is_audio_playing
is_audio_playing = True #check for audio status for video update propose
sound = AudioSegment.from_mp3(file)#expecting a mp3 file
audio_data = sound.export(format="wav")#transfer into wav file
audio_data = audio_data.read()
audio_data = audio_data[44:] #skipping the first 44byte of the wav file to avoid a chip noise at the start of the recording.
audio_wave = sa.WaveObject(audio_data, sound.channels, sound.sample_width, sound.frame_rate)
play_obj = audio_wave.play()
play_obj.wait_done()
is_audio_playing = False
def process_user_input(user_input, response_queue):
bot_response = "You said: " + user_input
response_queue.put(bot_response)
Building the Sleep Command
Because the video tutorial has a little wake function, so I thought to build a sleep command. So whenever the sleep word is detected, the bot will go back to listening for the wake word. So in an ideal situation, the user would always be able to hang the mirror on the wall.
The assistant listens for a wake word, which is set to the value of GPT_WAKE_WORD
. When it detects the wake word, it starts listening for user input and transcribes it to text using a speech recognition library.
The user input is then passed to the GPT-3 API using OpenAI's openai.ChatCompletion.create()
method to generate a response. The response is then synthesized into an audio file using a text-to-speech library and played back to the user.
The assistant also listens for a sleep word, which is set to the value of GPT_SLEEP_WORD
, to stop listening for user input and go back to listening for the wake word. During this transition, the assistant plays an audio file that indicates it's going to sleep.
The main function main_with_gui()
uses a GUI to display the assistant's responses and settings. The response_queue
is a queue that holds the bot's responses and text_var
is a tkinter variable used to update the GUI display. The settings
parameter is a dictionary that contains the settings for the GPT-3 API such as the model, the maximum number of tokens, and the system message to display to the user.
here is the code
def get_wake_word(phrase):
if GPT_WAKE_WORD in phrase.lower():
return GPT_WAKE_WORD
else:
return None
def get_sleep_word(phrase):
if GPT_SLEEP_WORD in phrase.lower():
return GPT_SLEEP_WORD
else:
return None
async def main_with_gui(response_queue, text_var, settings):
wake_word_detected = False
greeting_played = False
while True:
with sr.Microphone() as source:
recognizer.adjust_for_ambient_noise(source)
if not wake_word_detected:
print(f"Say {GPT_WAKE_WORD} to start a conversation...")
while True:
audio = recognizer.listen(source)
audio_file = "audio.wav"
with open(audio_file, "wb") as f:
f.write(audio.get_wav_data())
phrase = transcribe_audio_with_whisper(audio_file)
print(f"Phrase: {phrase}")
if get_wake_word(phrase) is not None:
wake_word_detected = True
break
else:
print("Not a wake word. Try again.")
if not greeting_played:
play_audio('greetings.mp3')
greeting_played = True
while wake_word_detected:
print("Speak a prompt...")
audio = recognizer.listen(source)
audio_file = "audio_prompt.wav"
with open(audio_file, "wb") as f:
f.write(audio.get_wav_data())
user_input = transcribe_audio_with_whisper(audio_file)
print(f"User input: {user_input}")
if get_sleep_word(user_input) is not None:
wake_word_detected = False
greeting_played = False
print("Sleep word detected, going back to listening for wake word.")
play_audio('sleep.mp3')
text_var.set("")
break
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": settings['system_message']},
{"role": "user", "content": user_input},
],
temperature=0.6,
max_tokens=settings['max_tokens'],
top_p=1,
frequency_penalty=0,
presence_penalty=0,
n=1,
stop=["\nUser:"],
)
bot_response = response["choices"][0]["message"]["content"]
response_queue.put(bot_response)
synthesize_speech(bot_response, 'response.mp3')
play_audio('response.mp3')
Building the user Interface
I was not sure what to use for the user interface, So I just googled and used ptinker.
In the process, I switched out text widget and used label for text display as it looked more modern.
The create_gui()
function creates the main window of the interface and sets its title, menu bar, and dimensions. It also initializes a dictionary called settings
with some default values that are used to control the behavior of the program, such as the maximum number of tokens to use when generating responses, and a message that is displayed to the user when the program starts.
The screen_width
and screen_height
variables are used to set the dimensions of the window to fill the entire screen.
The menu_bar
object is created using the Menu
widget, and a settings_menu
is created as a dropdown menu on the menu bar. The settings_menu
contains one option: "Open Settings", which is used to open a settings window when clicked.
The video_frame
is created to hold the video stream, and the text_frame
is created to display the program's responses to the user.
The update_video_label()
function is responsible for updating the video stream shown in the GUI. It reads frames from two video files (one showing a static image, and the other showing a talking AI), resizes them, and then displays them on the GUI. The function uses the Label
widget from Tkinter to display the video stream.
The text_var
and text_widget
variables are used to display the program's responses to the user in the text_frame
created earlier. The text_widget
is created using the Label
widget from Tkinter.
Finally, a response_queue
is created to hold the program's responses, and two threads are started: one to run the main program logic, and another to update the text_widget
with the program's responses.
Here is the code
def create_gui():
root = Tk()
root.title("ALICS (Artificial Intelligence for Life Improvement and Counseling Support)")
menu_bar= tk.Menu(root)
root.config(menu=menu_bar)
settings_menu = tk.Menu(menu_bar, tearoff=0)
menu_bar.add_cascade(label="Settings", menu=settings_menu)
settings_menu.add_command(label="Open Settings", command=lambda: show_settings_window(settings))
settings = {
'system_message': "Your name is Alice, an acronym for Artistic Intelligence for Life Improvement and Counseling Support. As a creative and insightful personal therapist, your mission is to help clients address their problems with touch of humor. Make an effort to connect with clients on a personal level by sharing relevant anecdotes or insightful metaphors when appropriate.",
'max_tokens': 150,
'settings_open': False
}
screen_width = root.winfo_screenwidth()
screen_height = root.winfo_screenheight()
root.geometry(f"{screen_width}x{screen_height}")
root.configure(bg="black")
system_message_var = StringVar()
system_message_var.set(settings['system_message'])
max_tokens_var = IntVar()
max_tokens_var.set(150)
video_frame = Frame(root, bg="black")
video_frame.pack(side="top", pady=(10, 0), anchor="center", expand=True)
video_file1 = "/Users/giaogiaoguo/Desktop/interface/Alics_vid/Defult2.MOV"
video_file2 = "/Users/giaogiaoguo/Desktop/interface/Alics_vid/SPeak.MOV"
video_capture1 = cv2.VideoCapture(video_file1)
video_capture2 = cv2.VideoCapture(video_file2)
second_video_frame_rate = 60
video_capture2.set(cv2.CAP_PROP_FPS, second_video_frame_rate)
def update_video_label():
global is_audio_playing, video_capture1, video_capture2
if is_audio_playing:
video_capture = video_capture2
scale_percent = 30
frame_rate=100
else:
video_capture = video_capture1
scale_percent = 30
frame_rate=100
ret, frame = video_capture.read()
if not ret:
video_capture.set(cv2.CAP_PROP_POS_FRAMES, 0)
ret, frame = video_capture.read()
cv2image = cv2.cvtColor(frame, cv2.COLOR_BGR2RGBA)
width = int(cv2image.shape[1] * scale_percent / 100)
height = int(cv2image.shape[0] * scale_percent / 100)
dim = (width, height)
resized = cv2.resize(cv2image, dim, interpolation=cv2.INTER_AREA)
img = Image.fromarray(resized)
imgtk = ImageTk.PhotoImage(image=img)
label.config(image=imgtk)
label.imgtk = imgtk
delay= int(1000 / frame_rate)
root.after(delay, update_video_label)
label = Label(video_frame, bg="black") #using lable instead of text widget.
label.pack(side="top", anchor="center")
update_video_label()
text_frame = Frame(root, bg="black")
text_frame.pack(side="top", pady=(50, 20), anchor="center", expand=True)
text_var = StringVar()
text_widget = tk.Label(text_frame, textvariable=text_var, wraplength=1000, bg="black", fg="white", font=("Nanum Gothic", 12))
text_widget.pack(expand=True, fill=BOTH, anchor="center")
response_queue = queue.Queue()
threading.Thread(target=lambda: asyncio.run(main_with_gui(response_queue, text_var, settings)), daemon=True).start()
threading.Thread(target=update_text_widget, args=(text_widget, text_var, response_queue, root), daemon=True).start()
root.mainloop()
def update_text_widget(text_widget, text_var, response_queue, root):
while True:
response = response_queue.get()
text_var.set(f"ALICS: {response}\n")
Animation, Logo and Clone voice
This part of the documentation could be found here.

Here is the 11 lab official documentation for python code, the use of the clone function is in here
Setting Function
I have singed out the “max token”and “system message”
This code defines two functions related to the settings window of the application:
create_settings_window(settings)
: This function creates a new window using the Tk() method and sets the title of the window as "ALICS Settings". It then defines a function called save_settings()
which updates the system_message
and max_tokens
values in the settings
dictionary with the values entered in the respective Entry widgets. After updating the values, the window is destroyed and the settings_open
flag is set to False.
The function then creates two Label widgets for the system message and max tokens and an Entry widget for each label to allow the user to enter their own values. The initial values of the Entry widgets are set to the current values stored in the settings
dictionary. A Save button is also created which calls the save_settings()
function when clicked.
show_settings_window(settings)
: This function checks whether the settings window is already open by checking the value of the settings_open
flag in the settings
dictionary. If the window is not open, it sets the flag to True and calls the create_settings_window()
function to create and display the settings window. If the window is already open, the function simply prints a message to indicate that the window is already open.
def create_settings_window(settings):
settings_window = Tk()
settings_window.title("ALICS Settings")
def save_settings():
settings['system_message'] = system_message_entry.get()
settings['max_tokens'] = int(max_tokens_entry.get())
settings_window.destroy()
settings['settings_open'] = False
system_message_label = tk.Label(settings_window, text="System message:")
system_message_label.grid(row=0, column=0, sticky="e", padx=5, pady=5)
system_message_entry = tk.Entry(settings_window)
system_message_entry.insert(0, settings['system_message'])
system_message_entry.grid(row=0, column=1, padx=5, pady=5)
max_tokens_label = tk.Label(settings_window, text="Max tokens:")
max_tokens_label.grid(row=1, column=0, sticky="e", padx=5, pady=5)
max_tokens_entry = tk.Entry(settings_window)
max_tokens_entry.insert(0, settings['max_tokens'])
max_tokens_entry.grid(row=1, column=1, padx=5, pady=5)
save_button = tk.Button(settings_window, text="Save", command=save_settings)
save_button.grid(row=2, column=1, padx=5, pady=5, sticky="e")
settings_window.mainloop()
def show_settings_window(settings):
print("show_settings_window called")
if not settings['settings_open']:
settings['settings_open'] = True
create_settings_window(settings)
The final product code and installation instruction
if you want to try this code, REMEMBER TO RUN IN ANACONDA WITH 3.9 PYTHON VERSION. ANYTHING ELESE WOULD NOT WORK!!!! BECAUSE OF NUMBA IS VERY UNDER DEVELOPED
NECESSARY PACKAGES TO INSTALL TO RUN THE PROJECT
pip install -U openai-whisper
pip install SpeechRecognition
pip install elevenlabs
pip install simpleaudio
pip install PyAudio // Please install pyaudio before pydub
pip install pydub
pip install opencv-python
Alics with bella voice
import cv2
from tkinter import *
import tkinter as tk
from PIL import Image, ImageTk
import threading
import openai
import asyncio
import whisper
import speech_recognition as sr
import simpleaudio as sa
from pydub import AudioSegment
import warnings
from numba import NumbaDeprecationWarning
import queue
from elevenlabs import generate, play
warnings.filterwarnings("ignore", category=NumbaDeprecationWarning)
openai.api_key = f"[YOUR API KEY]]"
recognizer = sr.Recognizer()
recognizer.energy_threshold = 300
GPT_WAKE_WORD = "hi alex"
GPT_SLEEP_WORD = "goodbye"
is_audio_playing = False
video_file1 = "/Users/giaogiaoguo/Desktop/interface/Alics_vid/Defult2.MOV"
video_file2 = "/Users/giaogiaoguo/Desktop/interface/Alics_vid/SPeak.MOV"
video_capture1 = cv2.VideoCapture(video_file1)
video_capture2 = cv2.VideoCapture(video_file2)
def create_settings_window(settings):
settings_window = Tk()
settings_window.title("ALICS Settings")
def save_settings():
settings['system_message'] = system_message_entry.get()
settings['max_tokens'] = int(max_tokens_entry.get())
settings_window.destroy()
settings['settings_open'] = False
system_message_label = tk.Label(settings_window, text="System message:")
system_message_label.grid(row=0, column=0, sticky="e", padx=5, pady=5)
system_message_entry = tk.Entry(settings_window)
system_message_entry.insert(0, settings['system_message'])
system_message_entry.grid(row=0, column=1, padx=5, pady=5)
max_tokens_label = tk.Label(settings_window, text="Max tokens:")
max_tokens_label.grid(row=1, column=0, sticky="e", padx=5, pady=5)
max_tokens_entry = tk.Entry(settings_window)
max_tokens_entry.insert(0, settings['max_tokens'])
max_tokens_entry.grid(row=1, column=1, padx=5, pady=5)
save_button = tk.Button(settings_window, text="Save", command=save_settings)
save_button.grid(row=2, column=1, padx=5, pady=5, sticky="e")
settings_window.mainloop()
def show_settings_window(settings):
print("show_settings_window called")
if not settings['settings_open']:
settings['settings_open'] = True
create_settings_window(settings)
def get_wake_word(phrase):
if GPT_WAKE_WORD in phrase.lower():
return GPT_WAKE_WORD
else:
return None
def get_sleep_word(phrase):
if GPT_SLEEP_WORD in phrase.lower():
return GPT_SLEEP_WORD
else:
return None
API_KEY = f"[YOUR API KEY]]]"
VOICE_ID = "[YOUR VOICE ID]"
def synthesize_speech(text, output_filename):
audio = generate(text, voice=VOICE_ID, api_key=API_KEY)
with open(output_filename, 'wb') as f:
f.write(audio)
def transcribe_audio_with_whisper(audio_file_path):
model = whisper.load_model("base")
result = model.transcribe(audio_file_path)
return result["text"].strip()
def play_audio(file):
global is_audio_playing
is_audio_playing = True
sound = AudioSegment.from_mp3(file) #
audio_data = sound.export(format="wav")
audio_data = audio_data.read()
audio_data = audio_data[44:]
audio_wave = sa.WaveObject(audio_data, sound.channels, sound.sample_width, sound.frame_rate)
play_obj = audio_wave.play()
play_obj.wait_done()
is_audio_playing = False
def process_user_input(user_input, response_queue):
bot_response = "You said: " + user_input
response_queue.put(bot_response)
async def main_with_gui(response_queue, text_var, settings):
wake_word_detected = False
greeting_played = False
while True:
with sr.Microphone() as source:
recognizer.adjust_for_ambient_noise(source)
if not wake_word_detected:
print(f"Say {GPT_WAKE_WORD} to start a conversation...")
while True:
audio = recognizer.listen(source)
audio_file = "audio.wav"
with open(audio_file, "wb") as f:
f.write(audio.get_wav_data())
phrase = transcribe_audio_with_whisper(audio_file)
print(f"Phrase: {phrase}")
if get_wake_word(phrase) is not None:
wake_word_detected = True
break
else:
print("Not a wake word. Try again.")
if not greeting_played:
play_audio('greetings.mp3')
greeting_played = True
while wake_word_detected:
print("Speak a prompt...")
audio = recognizer.listen(source)
audio_file = "audio_prompt.wav"
with open(audio_file, "wb") as f:
f.write(audio.get_wav_data())
user_input = transcribe_audio_with_whisper(audio_file)
print(f"User input: {user_input}")
if get_sleep_word(user_input) is not None:
wake_word_detected = False
greeting_played = False
print("Sleep word detected, going back to listening for wake word.")
play_audio('sleep.mp3')
text_var.set("")
break
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": settings['system_message']},
{"role": "user", "content": user_input},
],
temperature=0.6,
max_tokens=settings['max_tokens'],
top_p=1,
frequency_penalty=0,
presence_penalty=0,
n=1,
stop=["\nUser:"],
)
bot_response = response["choices"][0]["message"]["content"]
response_queue.put(bot_response)
synthesize_speech(bot_response, 'response.mp3')
play_audio('response.mp3')
def create_gui():
root = Tk()
root.title("ALICS (Artificial Intelligence for Life Improvement and Counseling Support)")
menu_bar= tk.Menu(root)
root.config(menu=menu_bar)
settings_menu = tk.Menu(menu_bar, tearoff=0)
menu_bar.add_cascade(label="Settings", menu=settings_menu)
settings_menu.add_command(label="Open Settings", command=lambda: show_settings_window(settings))
settings = {
'system_message': "Your name is Alice, an acronym for Artistic Intelligence for Life Improvement and Counseling Support. As a creative and insightful personal therapist, your mission is to help clients address their problems with touch of humor. Make an effort to connect with clients on a personal level by sharing relevant anecdotes or insightful metaphors when appropriate.",
'max_tokens': 150,
'settings_open': False
}
screen_width = root.winfo_screenwidth()
screen_height = root.winfo_screenheight()
root.geometry(f"{screen_width}x{screen_height}")
root.configure(bg="black")
system_message_var = StringVar()
system_message_var.set(settings['system_message'])
max_tokens_var = IntVar()
max_tokens_var.set(150)
video_frame = Frame(root, bg="black")
video_frame.pack(side="top", pady=(10, 0), anchor="center", expand=True)
video_file1 = "/Users/giaogiaoguo/Desktop/interface/Alics_vid/Defult2.MOV"
video_file2 = "/Users/giaogiaoguo/Desktop/interface/Alics_vid/SPeak.MOV"
video_capture1 = cv2.VideoCapture(video_file1)
video_capture2 = cv2.VideoCapture(video_file2)
second_video_frame_rate = 60
video_capture2.set(cv2.CAP_PROP_FPS, second_video_frame_rate)
def update_video_label():
global is_audio_playing, video_capture1, video_capture2
if is_audio_playing:
video_capture = video_capture2
scale_percent = 30
frame_rate=100
else:
video_capture = video_capture1
scale_percent = 30
frame_rate=100
ret, frame = video_capture.read()
if not ret:
video_capture.set(cv2.CAP_PROP_POS_FRAMES, 0)
ret, frame = video_capture.read()
cv2image = cv2.cvtColor(frame, cv2.COLOR_BGR2RGBA)
width = int(cv2image.shape[1] * scale_percent / 100)
height = int(cv2image.shape[0] * scale_percent / 100)
dim = (width, height)
resized = cv2.resize(cv2image, dim, interpolation=cv2.INTER_AREA)
img = Image.fromarray(resized)
imgtk = ImageTk.PhotoImage(image=img)
label.config(image=imgtk)
label.imgtk = imgtk
delay= int(1000 / frame_rate)
root.after(delay, update_video_label)
label = Label(video_frame, bg="black")
label.pack(side="top", anchor="center")
update_video_label()
text_frame = Frame(root, bg="black")
text_frame.pack(side="top", pady=(50, 20), anchor="center", expand=True)
text_var = StringVar()
text_widget = tk.Label(text_frame, textvariable=text_var, wraplength=1000, bg="black", fg="white", font=("Nanum Gothic", 12))
text_widget.pack(expand=True, fill=BOTH, anchor="center")
response_queue = queue.Queue()
threading.Thread(target=lambda: asyncio.run(main_with_gui(response_queue, text_var, settings)), daemon=True).start()
threading.Thread(target=update_text_widget, args=(text_widget, text_var, response_queue, root), daemon=True).start()
root.mainloop()
def update_text_widget(text_widget, text_var, response_queue, root):
while True:
response = response_queue.get()
text_var.set(f"ALICS: {response}\n")
if __name__ == "__main__":
create_gui()
Alics with clone voice
import cv2
from tkinter import *
import tkinter as tk
from PIL import Image, ImageTk
import threading
import openai
import asyncio
import whisper
import speech_recognition as sr
import simpleaudio as sa
from pydub import AudioSegment
import queue
import elevenlabs
from elevenlabs import generate, play
from elevenlabs import clone, generate as generate_voice
from elevenlabs import set_api_key
set_api_key(f"[YOUR API KEY]]")
openai.api_key = "[YOUR API KEY]]"
recognizer = sr.Recognizer()
recognizer.energy_threshold = 300
GPT_WAKE_WORD = "hi alex"
GPT_SLEEP_WORD = "goodbye"
is_audio_playing = False
video_file1 = "/Users/giaogiaoguo/Desktop/interface/Alics_vid/Defult2.MOV"
video_file2 = "/Users/giaogiaoguo/Desktop/interface/Alics_vid/SPeak.MOV"
video_capture1 = cv2.VideoCapture(video_file1)
video_capture2 = cv2.VideoCapture(video_file2)
def create_settings_window(settings):
settings_window = Tk()
settings_window.title("ALICS Settings")
def save_settings():
settings['system_message'] = system_message_entry.get()
settings['max_tokens'] = int(max_tokens_entry.get())
settings_window.destroy()
settings['settings_open'] = False
system_message_label = tk.Label(settings_window, text="System message:")
system_message_label.grid(row=0, column=0, sticky="e", padx=5, pady=5)
system_message_entry = tk.Entry(settings_window)
system_message_entry.insert(0, settings['system_message'])
system_message_entry.grid(row=0, column=1, padx=5, pady=5)
max_tokens_label = tk.Label(settings_window, text="Max tokens:")
max_tokens_label.grid(row=1, column=0, sticky="e", padx=5, pady=5)
max_tokens_entry = tk.Entry(settings_window)
max_tokens_entry.insert(0, settings['max_tokens'])
max_tokens_entry.grid(row=1, column=1, padx=5, pady=5)
save_button = tk.Button(settings_window, text="Save", command=save_settings)
save_button.grid(row=2, column=1, padx=5, pady=5, sticky="e")
settings_window.mainloop()
def show_settings_window(settings):
print("show_settings_window called")
if not settings['settings_open']:
settings['settings_open'] = True
create_settings_window(settings)
def get_wake_word(phrase):
if GPT_WAKE_WORD in phrase.lower():
return GPT_WAKE_WORD
else:
return None
def get_sleep_word(phrase):
if GPT_SLEEP_WORD in phrase.lower():
return GPT_SLEEP_WORD
else:
return None
def clone_voice():
voice = clone(
name="yuris",
description="",
files=[""],
)
return voice
voice = clone_voice()
def synthesize_speech(text, output_filename):
audio = generate_voice(text=text, voice=voice)
with open(output_filename, 'wb') as f:
f.write(audio)
def transcribe_audio_with_whisper(audio_file_path):
model = whisper.load_model("base")
result = model.transcribe(audio_file_path)
return result["text"].strip()
def play_audio(file):
global is_audio_playing
is_audio_playing = True
sound = AudioSegment.from_mp3(file)
audio_data = sound.export(format="wav")
audio_data = audio_data.read()
audio_data = audio_data[44:]
audio_wave = sa.WaveObject(audio_data, sound.channels, sound.sample_width, sound.frame_rate)
play_obj = audio_wave.play()
play_obj.wait_done()
is_audio_playing = False
def process_user_input(user_input, response_queue):
bot_response = "You said: " + user_input
response_queue.put(bot_response)
async def main_with_gui(response_queue, text_var, settings):
wake_word_detected = False
greeting_played = False
while True:
with sr.Microphone() as source:
recognizer.adjust_for_ambient_noise(source)
if not wake_word_detected:
print(f"Say {GPT_WAKE_WORD} to start a conversation...")
while True:
audio = recognizer.listen(source)
audio_file = "audio.wav"
with open(audio_file, "wb") as f:
f.write(audio.get_wav_data())
phrase = transcribe_audio_with_whisper(audio_file)
print(f"Phrase: {phrase}")
if get_wake_word(phrase) is not None:
wake_word_detected = True
break
else:
print("Not a wake word. Try again.")
if not greeting_played:
play_audio('greetings.mp3')
greeting_played = True
while wake_word_detected:
print("Speak a prompt...")
audio = recognizer.listen(source)
audio_file = "audio_prompt.wav"
with open(audio_file, "wb") as f:
f.write(audio.get_wav_data())
user_input = transcribe_audio_with_whisper(audio_file)
print(f"User input: {user_input}")
if get_sleep_word(user_input) is not None:
wake_word_detected = False
greeting_played = False
print("Sleep word detected, going back to listening for wake word.")
play_audio('sleep.mp3')
text_var.set("")
break
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": settings['system_message']},
{"role": "user", "content": user_input},
],
temperature=0.6,
max_tokens=settings['max_tokens'],
top_p=1,
frequency_penalty=0,
presence_penalty=0,
n=1,
stop=["\nUser:"],
)
bot_response = response["choices"][0]["message"]["content"]
response_queue.put(bot_response)
synthesize_speech(bot_response, 'response.mp3')
play_audio('response.mp3')
def create_gui():
root = Tk()
root.title("ALICS (Artificial Intelligence for Life Improvement and Counseling Support)")
menu_bar= tk.Menu(root)
root.config(menu=menu_bar)
settings_menu = tk.Menu(menu_bar, tearoff=0)
menu_bar.add_cascade(label="Settings", menu=settings_menu)
settings_menu.add_command(label="Open Settings", command=lambda: show_settings_window(settings))
settings = {
'system_message': "Your name is Yuris, You are a 19 years old college student who speaks with a sarcastic tone. You are talking to a friend who is trying to imporve her life",
'max_tokens': 150,
'settings_open': False
}
screen_width = root.winfo_screenwidth()
screen_height = root.winfo_screenheight()
root.geometry(f"{screen_width}x{screen_height}")
root.configure(bg="black")
system_message_var = StringVar()
system_message_var.set(settings['system_message'])
max_tokens_var = IntVar()
max_tokens_var.set(150)
video_frame = Frame(root, bg="black")
video_frame.pack(side="top", pady=(10, 0), anchor="center", expand=True)
video_file1 = "/Users/giaogiaoguo/Desktop/interface/Alics_vid/Defult2.MOV"
video_file2 = "/Users/giaogiaoguo/Desktop/interface/Alics_vid/SPeak.MOV"
video_capture1 = cv2.VideoCapture(video_file1)
video_capture2 = cv2.VideoCapture(video_file2)
second_video_frame_rate = 60
video_capture2.set(cv2.CAP_PROP_FPS, second_video_frame_rate)
def update_video_label():
global is_audio_playing, video_capture1, video_capture2
if is_audio_playing:
video_capture = video_capture2
scale_percent = 30
frame_rate=100
else:
video_capture = video_capture1
scale_percent = 30
frame_rate=100
ret, frame = video_capture.read()
if not ret:
video_capture.set(cv2.CAP_PROP_POS_FRAMES, 0)
ret, frame = video_capture.read()
cv2image = cv2.cvtColor(frame, cv2.COLOR_BGR2RGBA)
width = int(cv2image.shape[1] * scale_percent / 100)
height = int(cv2image.shape[0] * scale_percent / 100)
dim = (width, height)
resized = cv2.resize(cv2image, dim, interpolation=cv2.INTER_AREA)
img = Image.fromarray(resized)
imgtk = ImageTk.PhotoImage(image=img)
label.config(image=imgtk)
label.imgtk = imgtk
delay= int(1000 / frame_rate)
root.after(delay, update_video_label)
label = Label(video_frame, bg="black")
label.pack(side="top", anchor="center")
update_video_label()
text_frame = Frame(root, bg="black")
text_frame.pack(side="top", pady=(50, 20), anchor="center", expand=True)
text_var = StringVar()
text_widget = tk.Label(text_frame, textvariable=text_var, wraplength=1000, bg="black", fg="white", font=("Nanum Gothic", 12))
text_widget.pack(expand=True, fill=BOTH, anchor="center")
response_queue = queue.Queue()
threading.Thread(target=lambda: asyncio.run(main_with_gui(response_queue, text_var, settings)), daemon=True).start()
threading.Thread(target=update_text_widget, args=(text_widget, text_var, response_queue, root), daemon=True).start()
root.mainloop()
def update_text_widget(text_var, response_queue):
while True:
response = response_queue.get()
text_var.set(f"YOUR_NAME: {response}\n")
if __name__ == "__main__":
create_gui()
Other parts -presentation, mockup, wood work
Presentation

user (or just me) test video
Mock up

