Echo - Voice as Interface

person · January 3, 2025

Projects Echo

Grand goal

The creation of a voice interface that has integration with Everything. See: Her, I Robot, the DCOM Smart House, etc. In the same ubiquity of functionality as the keyboard and mouse.

The key principles in this project are

A tool should reduce friction or make process more consistent, or ideally, both.
A tool should be as easy to adopt as possible, fulfilling point above facilitates this point, naturally.
- This includes integrating with existing systems This project is aimed towards creation of a tool like that, via Voice as an Interface.
  Problems that exist
  
  Either prior to this project, or in naïve solutions to some of the goals
Inaccuracies in LLMs
Inaccuracies in speech recognition
Limitations in personal assistants in their very functionality
Knowledge Management in the Information Overload Age
Limitations on privacy orientation in smart home and general voice assistant tech
Major project-level concerns
Achieving this is so close and yet so far away from AGI
This could be promethian in full realization. I doubt most users are prepared for what we have even now.
System Components

Major system components
Smart Home/IoT integration
General computer/device integration
PKM easing
Error correction via multi-LLM interpretation/checking.

Smart home/IoT integration

Goal: mutli-context home automation via voice interface

There are already some real scenarios where this has partial application (like Home Assistant plugins, or VoiceOS)
LLMs can act as semantic glue, using RAG and large context models, an appreciation of context to an unrealized level (using microphone detection strength to contextualize user’s input to a room, user profile, and knowledgebase of current world, etc) is possible.
- If a user says “It’s a little dark in here”-> many microphones in the home may pick that up, but it should be easy to determine strength of reception to find closest to user, and then determine room context, finding lighting devices that are off or dimmed, and rectifying the issue. Historical data may even indicate contextually what dimness or brightness the user tends to use in that room, which using lux detection, can be targeted.
Naturalistic voice generation
- It is my personal opinion that diarization, proper inflection, pauses, and emotive variance, not the robotic/crusty soundingness of a voice, are more important points to focus on. In that order. The human on the other side of a bad microphone still sounds like a human because of the conveyance achieved from everything else.
- Probably the best execution I have seen so far, belongs to this short snippet of a longer video. The company is Speechmatics.
- StyleTTS2, has inflection, pauses, emotive fluctuation. It also, if properly tuned, can generate voices that have good clarity. Unsure if it is fast enough for real-time on personal compute devices. Also as far as the HuggingFace community is concerned, it ranks as probably the best open source model.
  - As for now, I use Piper(largely abandonware but well rounded for simple synth, and can be loaded as a server endpoint)
- Interruptible, mid-speech
- Conversational, not as a mode, but as a WYSIWYG (sed s/see/say/)
- Speaker identification (diarization + stored comparison data) via previous self-id voice sample submissions (later, becomes an associated context, and authentication/authority to system) What if the system could write its own scripts and test them with you on the spot? what if it could detect a new IoT device, and ask you where to allocate it? what if you didn’t have to get all fiddly with crap thanks to layers of autodetection, LLM glue, and active demoing/error correction?

General computer integration

Goal: voice as interface with compute substrate as clean and integrated as a keyboard. or, Imagine a world where screens are Optional (getting some Dynamicland or Folk vibes here).

LLMs have some nascent architectures that people are currently experimenting with
- LLMs like Claude, playing Minecraft using a function toolbox src
- Multi layer/stage processes, like documentation generation for large repositories like Driver AI.
- generally, agentic systems given toolboxes and context to work within sandboxes
- I can’t source it atm, but I saw a video recently of someone screen sharing to Sonnet(?) and having it write and see results in Python for Blender. They also used the voice to voice system, and glued it together via a mouse scripting program that basically, would copy/paste Sonnet’s python output into the Blender script page and execute it. Did not last beyond tutorial level complexity, but is another interesting avenue/example of automating execution on general compute I/O.

PKM easing

Goal: rubber duck with your assistant knowing all your notes via RAG, write better, write moar.

Broadly, this is almost its own project, but there’s a few points where model integration could automate a lot of integration.
Zettelkasten and various other models for note arrangement exist, Obsidian and Notion and Anytpe and Zettlr and so on, exist.
- we live in the future, meticulous tagging systems could be made trite compared to a properly realized LLM hooked into a RAG system.
I use a Supernote Nomad, linked to my OneDrive, to sync books and notes, both ways. The Supernote product supports generation of citations + handwritten notes contextualized to those citations. Known as Digests. It also supports an easy way to export those to pdf. These are also synced.
- Current issue is that I don’t have an OCR step for the handwritten part of the notes.
I also want to get an email pulling client setup so that I can get certain emails as PDFs so they can go thru the nomad’s system. This has been a small headache.
Everything should become a digestible PDF so I can shove it into a vector DB
an embedding model prototype could be run on a simple cron job to start. Later, maybe transitioning to a hook monitoring a, say, OneDrive folder, later. The point is tuning this correctly could turn a personal LLM into a second brain, able to keep up with all of your thoughts, and provide rubber-duck style conversation. Perfect for polishing out rough drafts of writing and furthering thoughtlines.

Error correction via multi-LLM interpretation/checking.

Goal: generally, figure out sustainable ways to greatly reduce hallucinations.

literally, Wisdom of crowds.
thin verification layer between whisper interpretations to filter for hallucinatory content.
likewise, for generated responses from core LLM.

Conclusion

I find that all of the pieces for this exist. They must be painstakingly woven together. The final result would be incredible to have from both a “the future is now” and a Quality of Life perspective. The direction of this project is not to use AI tech as a replacement for human cognition, but a truly helpful assistant, that transcends gimmick and becomes personable and useful.

Echo - Voice as Interface

Grand goal

The key principles in this project are

Problems that exist

Major project-level concerns

System Components

Major system components

Smart home/IoT integration

General computer integration

PKM easing

Error correction via multi-LLM interpretation/checking.

Conclusion