Grand goal
The creation of a voice interface that has integration with Everything. See: Her, I Robot, the DCOM Smart House, etc. In the same ubiquity of functionality as the keyboard and mouse.
The key principles in this project are
- A tool should reduce friction or make process more consistent, or ideally, both.
- A tool should be as easy to adopt as possible, fulfilling point above facilitates this point, naturally.
- This includes integrating with existing systems
This project is aimed towards creation of a tool like that, via Voice as an Interface.
Problems that exist
Either prior to this project, or in naïve solutions to some of the goals
- This includes integrating with existing systems
This project is aimed towards creation of a tool like that, via Voice as an Interface.
- Inaccuracies in LLMs
- Inaccuracies in speech recognition
- Limitations in personal assistants in their very functionality
- Knowledge Management in the Information Overload Age
- Limitations on privacy orientation in smart home and general voice assistant tech
Major project-level concerns
- Achieving this is so close and yet so far away from AGI
- This could be promethian in full realization. I doubt most users are prepared for what we have even now.
System Components
Major system components
- Smart Home/IoT integration
- General computer/device integration
- PKM easing
- Error correction via multi-LLM interpretation/checking.
Smart home/IoT integration
Goal: mutli-context home automation via voice interface
- There are already some real scenarios where this has partial application (like Home Assistant plugins, or VoiceOS)
- LLMs can act as semantic glue, using RAG and large context models, an appreciation of context to an unrealized level (using microphone detection strength to contextualize user’s input to a room, user profile, and knowledgebase of current world, etc) is possible.
- If a user says “It’s a little dark in here”-> many microphones in the home may pick that up, but it should be easy to determine strength of reception to find closest to user, and then determine room context, finding lighting devices that are off or dimmed, and rectifying the issue. Historical data may even indicate contextually what dimness or brightness the user tends to use in that room, which using lux detection, can be targeted.
- Naturalistic voice generation
- It is my personal opinion that diarization, proper inflection, pauses, and emotive variance, not the robotic/crusty soundingness of a voice, are more important points to focus on. In that order. The human on the other side of a bad microphone still sounds like a human because of the conveyance achieved from everything else.
- Probably the best execution I have seen so far, belongs to this short snippet of a longer video. The company is Speechmatics.
- StyleTTS2, has inflection, pauses, emotive fluctuation. It also, if properly tuned, can generate voices that have good clarity. Unsure if it is fast enough for real-time on personal compute devices. Also as far as the HuggingFace community is concerned, it ranks as probably the best open source model.
- As for now, I use Piper(largely abandonware but well rounded for simple synth, and can be loaded as a server endpoint)
- Interruptible, mid-speech
- Conversational, not as a mode, but as a WYSIWYG (sed s/see/say/)
- Speaker identification (diarization + stored comparison data) via previous self-id voice sample submissions (later, becomes an associated context, and authentication/authority to system) What if the system could write its own scripts and test them with you on the spot? what if it could detect a new IoT device, and ask you where to allocate it? what if you didn’t have to get all fiddly with crap thanks to layers of autodetection, LLM glue, and active demoing/error correction?
General computer integration
Goal: voice as interface with compute substrate as clean and integrated as a keyboard. or, Imagine a world where screens are Optional (getting some Dynamicland or Folk vibes here).
- LLMs have some nascent architectures that people are currently experimenting with
- LLMs like Claude, playing Minecraft using a function toolbox src
- Multi layer/stage processes, like documentation generation for large repositories like Driver AI.
- generally, agentic systems given toolboxes and context to work within sandboxes
- I can’t source it atm, but I saw a video recently of someone screen sharing to Sonnet(?) and having it write and see results in Python for Blender. They also used the voice to voice system, and glued it together via a mouse scripting program that basically, would copy/paste Sonnet’s python output into the Blender script page and execute it. Did not last beyond tutorial level complexity, but is another interesting avenue/example of automating execution on general compute I/O.
PKM easing
Goal: rubber duck with your assistant knowing all your notes via RAG, write better, write moar.
- Broadly, this is almost its own project, but there’s a few points where model integration could automate a lot of integration.
- Zettelkasten and various other models for note arrangement exist, Obsidian and Notion and Anytpe and Zettlr and so on, exist.
- we live in the future, meticulous tagging systems could be made trite compared to a properly realized LLM hooked into a RAG system.
- I use a Supernote Nomad, linked to my OneDrive, to sync books and notes, both ways. The Supernote product supports generation of citations + handwritten notes contextualized to those citations. Known as Digests. It also supports an easy way to export those to pdf. These are also synced.
- Current issue is that I don’t have an OCR step for the handwritten part of the notes.
- I also want to get an email pulling client setup so that I can get certain emails as PDFs so they can go thru the nomad’s system. This has been a small headache.
- Everything should become a digestible PDF so I can shove it into a vector DB
- an embedding model prototype could be run on a simple cron job to start. Later, maybe transitioning to a hook monitoring a, say, OneDrive folder, later. The point is tuning this correctly could turn a personal LLM into a second brain, able to keep up with all of your thoughts, and provide rubber-duck style conversation. Perfect for polishing out rough drafts of writing and furthering thoughtlines.
Error correction via multi-LLM interpretation/checking.
Goal: generally, figure out sustainable ways to greatly reduce hallucinations.
- literally, Wisdom of crowds.
- thin verification layer between whisper interpretations to filter for hallucinatory content.
- likewise, for generated responses from core LLM.
Conclusion
I find that all of the pieces for this exist. They must be painstakingly woven together. The final result would be incredible to have from both a “the future is now” and a Quality of Life perspective. The direction of this project is not to use AI tech as a replacement for human cognition, but a truly helpful assistant, that transcends gimmick and becomes personable and useful.