voice synthesis – Personal Blog

I’m a huge fan of synthetic voices. I love devices around me chattering away letting me know things so I don’t have to look at another screen and interpret what’s being displayed.

But the thing is, these voices don’t have to be great. In fact I prefer if they’re a little clunky and jagged as they realise the dream of the future I had as a kid growing up in the 80s. I thought that when the year 2000 rolled around, we’d have talking robots wandering our houses, our kitchen appliances would announce when they’re being turned on and off, and our houses would announce “night mode” as dusk rolled around. Unfortunately the world has turned out to be far more conservative in weirdness than I had hoped, so I realised I had to make this happen for myself.

I already have home automation happening in my home, built with Home Assistant at its core, and with a focus on locally-processed everything rather than relying on cloud based services like Google and Amazon offer. This has allowed me the freedom to get as weird as I want, and to make the look and feel exactly what I want.

Part of this process has been to reflash every smart bulb and smart switch I use with the amazing open source Tasmota. It allows for truly locally processed and linked devices, that don’t need an external service, just your local Home Assistant controller. In my case I have Home Assistant running on a tiny Raspberry Pi 4 upstairs on the wall.

Tasmota is breathtaking in complexity and ability. It can adapt to almost every smart device and is constantly being expanded, and yet still fits on the super tiny and super cheap ESP8266 and ESP32 chips that are found in almost every smart iot device on the market (and of course you can buy them standalone for your own builds).

Recently I was forced to compile Tasmota from sources to enable some built in functions that aren’t enabled in the default binary builds (for a kitchen control interface requiring a multiplexor chip). While I was doing this I stumbled across some very promising libraries that were in the source code for audio and “SAM” text to speech. My heart skipped a beat.

For those not in the know, “SAM” (or Software Automated Mouth) was a program for the ancient Commodore 64 computer, that allowed for some of the earliest domestic speech synthesis. It’s very recognisable as it was in so many things from movies, to music, to TV, as well as being every 80s kid’s dream. A computer that can talk!

Turns out, this software was ported to a C library by Sebastian Macke and put up on GitHub some time ago, and then adapted to run on microcontrollers by Earle F. Philhower, III. (especially the ESP8266). This meant you could already make this happen if you wrote your own code from scratch and use the library on ESP8266, but somewhere along the way it was added to Tasmota. I couldn’t find documentation for it, but there it was, hiding away, along with commands to make your Tasmota speak.

I quickly realised, though, that in order to perform this trick, I’d need to also buy an I2S IC/board and amplifier as the audio output library relied on I2S which is a simple audio interfacing specification. Being that I wanted to use this voice inside my doorbell button, I didn’t want to spend the money, or make the doorbell button that large to fit all of this.

That’s where I did some digging and found that the ESP8266audio library had a mode where it could roughly bit-bang audio out of the RX pin of the ESP board. From this output, you could make a very simple amplifier with 2 basic transistors to drive a speaker at an audible volume.

Unfortunately, Tasmota source code didn’t have this ability yet, so I set about forking the source code, modifying it, and merging it back (pull request) to Tasmota’s team to add this ability.

The nimble team have already merged this into the Tasmota Development branch so it’s ready to use, but you’ll need to compile it yourself. I won’t go into setting up an IDE for Tasmota compilation from source as that’s been covered quite well by other people including in the readme for Tasmota itself (I recommend the Atom + PlatformIO method):

https://github.com/arendst/Tasmota

Make sure you clone the Development branch (as at 10th Feb 2021) – it’ll move into the main releases at some point.

In order to enable audio output for Tasmota without I2S hardare, you’ll need to add to your “tasmota/user_config_override.h” file the following:

#ifndef USE_I2S_AUDIO
#define USE_I2S_AUDIO
#endif

#ifdef USE_I2S_EXTERNAL_DAC
#undef USE_I2S_EXTERNAL_DAC
#endif

#ifndef USE_I2S_NO_DAC
#define USE_I2S_NO_DAC
#endif

This allows you to enable audio, override the default (to use an external I2S DAC board), and enable the use of direct output.

But before we do anything more, we definitely need to hook up at least one transistor to the output from the ESP chip as you definitely cannot drive a speaker directly (it’ll also probably burn out the chip, or the pin on the chip trying to do so). For the following I assume that you’re running your ESP board from 5V to its 5v/USB input so that it regulates its required 3.3v onboard. We’ll use some of this 5V to feed the transistor and in turn the speaker.

You’ll need:
1 x 2N3904 transistor (NPN type, driven by positive voltage, but switching the negative)
1 x 1k resistor
1 x 3w or so speaker (nothing under 4 ohms)

When driving the audio output with this method, it will always come out of the RX pin of the ESP board. So when I say audio output, I mean the RX pin.

Connect the resistor between the RX pin and the base of the transistor (middle leg).
Connect the collector of your transistor (right pin of transistor with flat face side facing you) to the negative side of your speaker
Connect the positive side of your speaker to 5 volts
Connect the emmitter of your transistor (left pin of transistor with flat face side facing you) to the Ground or negative from the 5V supply, or the ground of your ESP board.

This is a very basic single transistor amplifier. This is what’s outlined on the ESP8266audio library page here:

https://github.com/earlephilhower/ESP8266Audio

Yes the output can be a little rough, and yes if you went to use some of the other capabilities like playback of files, playing of web radio stations (Which is actually pretty cool), they would sound pretty rough which a whistle over the top, but the SAM voice sounds perfectly the same as it originally did.

So we’ve uploaded our custom-compiled Tasmota binary to the board, how can we make it speak? Well documentation is thin (I’ll contribute some to the Tasmota project to help out of course), but you only need to issue the following at the console of Tasmota:

I2SSay(text goes here)

If you’ve played with old speech synthesizers before, you’ll know that they don’t always pronounce words correctly, so you’ll need to craft words at times to sound the way they’re supposed to. For example the word “house” can sound a little strange, so I use the word “howse”. Sometimes adding an h after vowels in words can help too. It’s all up to experimentation.

So it can speak when we issue commands at the console of Tasmota now, but that’s not super useful yet. We want automation!

I use Home Assistant, combined with the MQTT integration for my Tasmota linked automation, so it’s quite easy to issue anything that can be done at the console in Tasmota as an MQTT message.

In whatever script or automation you’re building in Home Assistant, all you need to do is add action type “Call Service” with the service being “mqtt.publish”, and the service data as:

payload: (hello I am home assistant. I am pleased to meet you!)
topic: cmnd/speakboy/I2SSay

You’ll see in the topic above that my Tasmota device has “device name” in config -> other config set to “speakboy”. The payload is simply what you want to say, surrounded by brackets. You can of course put substitution into play to drop in current weather conditions, or variables or whatever you want using Home Assistant methods, as long as it comes out as something that SAM can say.

You may find in your case, like mine, that audio output wasn’t high enough in volume for your purposes. I’m using mine as a doorbell announcer (at the button end, to speak to visitors while they wait for me to run down the stairs for the door) so there is road noise to compete with.

The first step is to try the gain control. It is set at 10 by default, but I found a balance between loudness and distortion to be at 20. Simply issue the command in the console in Tasmota:

I2SGain 20

If we also want to improve the speaker, mount it in a hole in a hollow box or cavity, or even a short length of pvc pipe glued to the back. The back pressure will give the speaker more ooph, as well as allowing some more resonance.

If it still isn’t loud enough we can go further with another transistor. It’s quite easy to use a suitable PNP transistor in combination with the already explained NPN transistor to amplify that current even higher for the speaker.

I’m using a BC559 PNP transistor for the purpose. By modifying how we above ran our simple amplifier, we can get more current to the speaker:

Disconnect the speaker, connect the collector on the 2N3904 to 5V
Disconnect the emitter of the 2N3904 from ground and connect it instead to the base of the BC559 (middle pin)
Connect the Collector of the BC559 (left pin when facing the flat front) to ground/negative.
Connect the Emitter of the BC559 (right pin when facing the flat front) to the negative of the speaker.
Connect the positive of the speaker to 5V

You should now be much MUCH louder, but just make sure you’re not overdoing it by feeling the transistors. They shouldn’t be getting hot.

A quick note here: Never connect this to an actual amplifier. It’s switched DC voltage, not variable AC which is what audio is. It’s also WAY too high for line level audio, at around 5 times the gain. Bad things will happen to the amplifier, and if they don’t, it’ll also sound terrible!

So there you have it. Tasmota speaking everywhere all the time! Get in touch if you have problems or comments – always happy to help!

Tag: voice synthesis

Tasmota on ESP8266 can speak