On Dec 11, 2014, at 11:53 PM, Aran Lunzer wrote:
Hi Robert
Thanks for jumping into these experiments!
It’s been a year: maybe it’s time for Conferencing 5.0?
In any case, I came back to this thread recently. I’ve been doing some work with audio/transcriptions/alignment, and have wanted to build tools for thinking through voice/audio (aided by text).
I’m not exactly happy with it, but it’s interesting enough that I thought to share it (i.e. so I can stop working on it for a while).
The things I’m disappointed with:
• The transcription quality (provided by
Kaldi) is not quite up to commercial (Google, Apple, Microsoft, Baidu) standards. That said, it’s not hugely behind, and I have many ideas for improving the quality. More damning, however, is how authoritative the text appears, even when it’s wrong. I would like to show a confidence measure, or alternatives. I’m craving
lattices.
• The transcription speed can lag behind realtime. On my MacBook, locally, it’s just about realtime, but my server doesn’t quite have the CPU to keep up. And since the transcription uses a lot of RAM, I only allocate two transcription resources, so if more than two people are speaking at once, not everyone will be transcribed live. This one, at least, is a problem that can easily be solved by money.
• The boxy design doesn’t really work. What I was going for initially was something that looked more like a shared document, but which was entirely assembled (live) by voice. So, for instance, when speaking you would have to indicate where in the document you were speaking. I’m still thinking through a UI that can effectively disambiguate between correcting a transcription and editing the document.
• It’s fairly low bandwidth (should only need 16kB/s upstream and 16*N down), but does not handle poor connections gracefully. Then again, at least you can get back to the audio, so in some ways it’s better than anything else out there…
The things that are sort of interesting:
• The paradigm of a conversation blending into a document as it’s happening strikes me as very fertile territory for exploration.
• The click-a-word-to-listen interaction is very satisfying. Combined with the text area being editable (when the background is white), it’s a compelling beginning to an extremely simple transcription UI. It’s already much better than VLC + an empty Microsoft Word document, which as far as I know is what most non-professionals use…
• What would it be like to use this as a “kiosk” in a shared space to record audio notes, decisions, messages, etc?
Would love to “hang out” with people here to test it. It’s been very buggy[0], but it’s hard for me to break it alone :)
I’ll try to keep the
cdg channel more-or-less open for a couple days.
[0] It won’t work in Safari (they refuse to give access to the microphone), nor will it work in mobile (though Android support is hypothetically possible). Please send me bug reports off-list.
I'm also really looking forward to exploring how your video summarisation and browsing interfaces could be put to good use in Conf 4.0.
Incorporating video/screens/grids would be a possible next step. More on this soon (by some definition of soon).
Wondering: is there a snappier name we could use?
“Snapping” is actually a meaningful concept w/r/t the alignment of text and audio.
Your correspondent,
R.M.O.