This is a guide written specificially for the obtaining of Vocaloid song stems, including where to obtain them, and if none exists, how to create them.

Clean stems

For the best quality stems, you would want to find stems that are the cleanest and came from official sources, which ultimately should come from the artist themselves. Here are some of the sources that I would recommend:

  • Vocaloid Collection
    • official stems released by artists for remixing purposes etc.
    • tracker for stem releases from past VocaColle events[1]
  • BMS charts
    • bms chart packs contain segments of stems as individual samples, spliced by people who have access to the stems
    • use with bmx2stems to obtain full stem/multitrack pack
  • “Offvocals”
    • official instrumentals released by artists, sometimes also including stems
    • most Vocaloid related offvocals are hosted on Piapro
    • often linked in music video/audio on artist channels
  • Album rips
    • albums may contain instrumentals of certain songs
  • Project Diva rips
    • Arcade Future Tone HDD rips contain instrumentals and vocals in two different mixes (8-channel ogg files)

DIY stems

Sometimes only certain parts of the stems are available, and you will have to DIY the remaining ones.

If that is the case, you could do waveform subtraction to obtain stems which may or may not require further processing with AI to be useable.

Waveform subtraction

With waveform subtraction, full song - instrumental = vocals. easy!

First you need to get the full song and one of the stems, typically being the instrumental. Often though, the full song and the instrumentals you can get a hold of are of different quality, e.g. having different sampling rate; being lossless (wav/flac) vs. being lossy (mp3). Don’t worry, we can still carry on with waveform subtraction, just that these are the cases where further processing would be beneficial.

Put them into the audio processing software you are familiar with, and can handle zooming in down to the sample level. I generally use Audacity for the task.

Then you need to match them up at the exact sample and amplitude/volume. I usually look for a 1 to 2 sample spike that is present on both tracks to sync.

Invert one of the tracks, typically the stem track, and mix them down.

You should see a result with parts that are quieter than the original tracks. That means you have got a track that is cleaner for subsequent processing.

Some of the stems obtained this way would be clean enough to be used directly, but those are few and far between.

AI separation

After performing waveform subtraction, or if you only have the full song, you can use AI music demixers for separating the stems even more. Here are some of the best open source ones you can install and use on your computer:

  • Demucs is good for extracting instrumentals, but is a bit weaker than MDX-net in extracting vocals
    • Demucs v4 includes experimental AI models for extracting guitars and piano, though as the project repository states the piano model “doesn’t work so well at the moment”.
  • MDX-net is good for extracting vocals, but does so with a frequency cutoff, i.e. lossy output.
    • UVR developers have trained a vocal model with even higher quality output than the one included in leaderboard_B branch.
  • Ultimate Vocal Remover is a GUI application that simplifies the process of utilizing AI music separation models. You can use different combinations of one or more AI models (“ensembling”) to get to the results you like.

If you do not have a beefy computer that could handle AI, you can take a look at mvsep, which provides AI music separation services free of charge.

Separating backing vocals/chorus though are a bit of a hit or miss. MDX-net can obtain more of the backings in the vocals track, but ultimately the AI models aren’t trained with backing vocals in the lead vocals track, so your mileage may vary.

If you want to research further into different kinds of AI music separation models, you can browse the MDX21 and MDX23 challenges, which generated a bunch of good open source models. You can compare or even ensemble them to find one mix to your liking.

Useful tools

Here are some useful tools that I have or haven’t mentioned in the guide. They are good tools nonetheless:

  • Audacity (free)
    • the utility knife in audio processing
    • more granular control in matching up the samples and volume
    • been in some controversies previously, use Audacity v2 or Tenacity/Saucedacity fork if you don’t like what the Audacity team did
  • Utagoe (free)
    • simple tool to perform waveform subtraction
    • I believe it can automatically line up the source tracks by sample
    • does not support volume leveling though
    • supports only 16-bit wav files
    • English translation version available from UVR team
  • Ultimate Vocal Remover (free)
    • self hosted all-in-one AI demixing tool with a GUI for using different AI models
  • mvsep (free)
    • provides AI demixing services for free in case you can’t host the tools yourself
  • Mixed In Key (paid)
    • auto detection of key and BPM
    • some official license keys were leaked for MIK 8 back when they were searchable on search engines like Google

  1. There are more stems with some that are hard to find in the “Extras” tab. ↩︎