Singing Vocal Synthesizers: Sinsy (English) Tutorial!

Hey! No, I didn't abandon my tutorials. Just been busy is all, but I'm restless after a long day.

Anyway, today I'm gonna discuss:

I feel many people struggle to comprehend the tutorials and find working resources. I do it a bit differently than the official tutorials because most of the recommended resources don't work for me very well. Now, if you're an active part of the Singing Synthesizer Community, you probably have the tools needed already! Here's what you'll need, pre-tutorial:

- UTAU (We are going to export as a MIDI)
- (optional) the UTAU Plugin "ImportVSQX" installed; knowledge of how to use the "Import" function to import .vsq or MIDI files.
- CeVIO Creative Studio.
(Note: with CeVIO, I don't know if you need a vocal to use the Song Editor because I have vocals. If you do need a vocal, HAL-O-ROID is free to download.)
- The website sinsy.jp, the official phoneme reference PDF.
- Patience, and an open mind.

Heading to Sinsy, you're immediately granted with a simplistic and text-heavy site. You may be intimidated, but try to relax. It's not difficult to figure out.

You can change the language of the website to English in the top right hand corner, but doing so will immediately default you to English vocals, just so you know. There used to be a bug on the site where changing the language of the vocals being used changed the website's language. It seems to be gone, but be wary.

So now that the website is in English, you can get a good look at what the Parameters are. Most are self-explanatory, but Pitch Shift might be something some people don't know. Pitch Shift is just changing "semitones" (the website uses another name for them; halftones). If you're unfamiliar with them, (before you scoff, some people use Singing Synthesizers out of technological curiosity and don't know jack about music) just know that they are portions of an octave. By entering "1", it would be like...If in UTAU you selected all notes and moved them up exactly one space. By entering "12" you would have moved it up one octave (12 spaces). By entering "24" (the maximum Sinsy allows) you would have moved it up two octaves (24 spaces).

We've acknowledged the website and Parameters. Hooray! For the sake of an example, I'm going to demonstrate via .vsqx import into UTAU, since I feel most people in the Singing Synthesis Community would like to make covers with Sinsy, and gradually become comfortable enough to write an original, but lack the tools.

I'm going to cover an English Song, which means I need a .vsqx that has English Lyrics so I can edit more easily. I've been on a GHOST roll lately, so I'm going to choose a .vsqx of their song called "Happy Days" ft. MAIKA. The .vsqx was made by Grace Herring.
(Note: The song may potentially have upsetting material or ideologies, which is why I didn't link it.)

Though I adore Matsuo, I'm going to use Xiang-Ling. She is terribly scorned for her English mispronunciation, though Matsuo is as well. Now, you may be saying, "Talc, why would you use her if she mispronounces words?" Well, I'm going to show you how to override her mispronunciation. The source of mispronunciation comes from sounds assigned to isolated vowels, homonyms, and homographs. English is complex, and it can be hard to tell which way the word is supposed to be said. Before I get to the cover, I want to introduce the problem we'll run into and how to solve it.

For example: I know that by default, Xiang-Ling will pronounce "wind" as "why-nd" as in, to wrap an object around something or itself, instead of "wih-nd", as in, the movement of air. So what can I do to change this? Phoneme Input.

Now's when I'm going to ask you to open that reference sheet PDF I linked earlier. You may notice that it covers Japanese, English, and Mandarin Phonemes, but for today I am going to focus on English. The first page you see has some very helpful information regarding pronunciation and gives you a visual gist of what we're going to be doing.

The example is off pronunciation-wise, but they do make an important observation: "an answer" is said uniquely. They note using the asterisk only works after a vowel and will cause stress (read: annunciation) of the asterisked vowel.

Without the asterisk "an answer" would likely be pronounced "ananswer", but that's not how "an answer" is said. English speakers usually stress the "an" of "answer" so that the two words don't blend to avoid muttering. From a distance or low voice, "ananswer" could sound like "announcer", which might cause confusion. To my knowledge the asterisk is to help with lexical stress, since the CMU dictionary offers to display it.

Let's have a look at the English phoneme chart on Page 4. I think the consonants are easy to figure out, but some vowels aren't as apparent. I'll give sample words.

aa = box
ae = apple
ax = about (it's the schwa.)
ah = ability
ao = flower
aw = caught
ay = fly
eh = ever
er = urge
ey = ape
ih = igloo
iy = tree
ow = ocean
oy = coy
uh = look
uw = food

EDIT: i looked up CMU and you can combine sounds like "ah" and "r" [ah, r] to get "ar". you can also get "ts" by going [t, s] and etc. if you feel like you'll have difficulty, use the CMU dictionary to look up a word you can't think of how to write phonetically.

so for example, some words would be written out as lyrics on spaces like this:
chance[ch, ae, n, s]
swirl[s, w, er, l]
coil[k, oy, l]

you could also play around with the | function and try:
thing[th, iy|ng]
pack[p, ae|k]
flirt[f, l, er|t]

We have our phonemes and lyrics to the song, let's import the .vsqx. What I like to do is immediately export it as a MIDI in UTAU. I don't know what Xiang-Ling is going to mispronounce off the top of my head other than isolated vowels, so making edits is like taking shots in the dark. Then, I open CeVIO and import the MIDI. Once it is inside CeVIO, don't mess with it in the Song Editor. If CeVIO recognizes a lyric as invalid (basically any English word) it renders it as a rest inside the MusicalXML.

Export the MIDI in CeVIO as a MusicalXML and once you have that, you can head right back to Sinsy and Choose the File, then hit "Send". It will take a few moments depending on song length, (5-7 minutes is the maximum) If there's anything wrong with the file you uploaded it will tell you what happened.

Once it's ready, you will see a little audio player appear like this:

With this, you'll be able to listen to the voice (or download and listen) to identify which words were said wrong.

Your edits in UTAU may look a bit bulky, like this:

But that's nothing irregular. I've used Phoneme Input to attempt to generate a more solid/clear sample on correctly said words and sometimes, it works.

Also, one last thing: Sinsy does support dashes! If you have a sequence like:

[to] [mo] [row] [-]

[row] will be extended just fine, but sometimes Sinsy experiences a little difficulty extending suffixes that end in consonants. Only sometimes, it's usually pretty good.

The patience comes with identifying what is said wrong throughout the whole song and editing it into correction, but often times other Singing Synthesizers mispronounce words too. As it is right now, many Singing Synthesizers are very manual. That's just how it is.

Now that you have an idea of what to do, go for it! :)

Here's a tidbit of what I worked on. I think Sinsy is worth your time! Please give it a go.

Take a note that both vocals have a slight accent, some ih/iy's may be similar and consonants may be slight or harsh. Using phoneme input can sometimes fix this.

2 comments:

Andrey ZamaraevFebruary 19, 2019 at 1:05 PM
Greetings! Just wanna share the result of my own experience with Sinsy. The new music project called Graviteka, featuring vocals fully rendered with Sinsy engine. Your tutorial was helpful in the process too. Thank you!
https://graviteka.bandcamp.com/album/wireframe-vision

Singing Vocal Synthesizers

Search This Blog

Saturday, March 10, 2018

Sinsy (English) Tutorial!

2 comments:

Blog Archive

did you find this info helpful?