How Blokur is matching music metadata in Asian characters

As it becomes easier to access music from all over the world, processing song data rendered in different character sets has become increasingly important.

The music industry already has problems with metadata accuracy due to the need to combine data from so many different sources, and adding the complexity of processing non-Latin characters only exacerbates the issue.

What makes matching Asian characters such a challenge?

As described in a previous Blokur blog, in the world of music rights, matching involves consolidating information in order to link related song data from different databases. At present, much of this information is represented using the Latin alphabet.

Writing systems that use logograms, such as Simplified and Traditional Chinese Han characters and Japanese Kanji, present difficulties when processing song data. This is because each logogram represents a sound or tone whereas languages such as English, Hindi, and Korean use a writing system where letters represent ‘consonants’ or ‘vowels’. When combined, these letters create syllables that form defined words. For example, in Hindi, the word “radio” is pronounced “re-di-yo” and spelt रे (re) डि (di) यो (yo) = रेडियो.

Challenge 1) Same characters, different tones

2023 BlogGraphic Pinyin Xing Hang Chinese to English Characters

Example showing two pronunciations for the Chinese character 行

In Chinese dialects, there can be several tones associated with one character. For example, as shown in the image above, the character “行” can be pronounced in two different ways: Xíng or Hāng, thus having two different meanings. These pronunciations can also change depending on the context and dialect.

2023 BlogGraphic MusicDatabase Example

Example of how Chinese song titles are typically represented in databases

The image above is an example of how databases often represent Chinese titles. Many exist in block capitals with no way of demarking tones; this presents problems when trying to link romanised titles to the correct Chinese characters.

Challenge 2) Different writing scripts (Simplified vs Traditional Chinese)

2023 BlogGraphic Red and Surround in Chinese Characters

Simplified Chinese vs Traditional Chinese: Example showing the traditional and simplified Chinese characters for “red”(紅 Hóng) and “around” (圍 Wéi)

What’s important to note is that Chinese Han characters have two versions, a traditional and a simplified form. All characters exist in the traditional form but about a third of them can be simplified. These simplified characters have fewer strokes that make them easier to read and write.

  • Simplified Chinese characters are used in mainland China, Malaysia and Singapore.

  • Traditional Chinese characters are used in Taiwan, Hong Kong and Macau.

The challenge this presents is that depending on the region the song is from, Mandarin and Cantonese can be represented by both Simplified and Traditional Chinese characters.

Example:

  • A Mandarin song titled “guò lái” (Come Here) would be represented as “過來” in Taiwan.

  • However, a song with the same title from the Mandarin-speaking regions of mainland China would be written as “过来”.

When matching song titles, it’s crucial to factor in the different character representations a romanised word could have to create an accurate match.

Challenge 3) Chinese, a language family with different dialects that use the same characters but different transliterations

2023 BlogGraphic Map of sinitic languages full-en.svg

Sinitic Languages — Map of common Chinese dialect groups including Mandarin, Cantonese (Yue), Hakka, Wu, Min and others.

Chinese isn’t a singular monolithic language but rather an umbrella term for many related dialects with different degrees of similarity. For example, two well-known Chinese variants are Mandarin and Cantonese (Yue).

  • Mandarin is spoken in many areas of mainland China, Taiwan and countries such as Singapore and Malaysia.

  • Cantonese (Yue), is the dominant dialect in Hong Kong, the southern region of mainland China and Macau.

Another challenge different Chinese dialects present is that when matching song titles the pronunciation and translation to English may differ even if the same characters are used. For example, a database containing the word “JI” could mean “to send” 寄= jì if we are referring to Mandarin from the mainland. However, in Cantonese from Hong Kong, “JI” translates to “meaning” 意=ji3.

Transliterations (pinyin, jyutping and others)

2023 BlogGraphic Pinyin - jyutping to English examples

Example showing the romanisation of the Chinese character for ‘man’ in Pinyin (rén) , Jyutping (jan4) and Hakka — Sixian, PFS/ Pha̍k-fa-sṳ (ngìn)

The image above is an example of transliteration, which is the process by which letters from one writing script are represented phonetically in another writing script. “Pinyin” is the term used to define the romanisation system for Mandarin. In pinyin, tones are represented using accents.

Example:

  • The character “世” (world), when pronounced in Mandarin, is transliterated as “shì”.

  • The “jyutping“, Cantonese transliteration, of the same character is represented as “sai3”. Jyutping uses a number system to depict tones.

These are two of many methods used to represent Chinese dialects using the Latin script.

As mentioned above, what makes this challenging is that when matching titles, many databases represent Chinese song titles in block capitals with no tone markings, making it hard to decipher which characters they represent or translate them.

Take for example the track “一无所有” by celebrated Chinese Rock musician Cuī Jiàn. In a typical database this song title could be represented as “YI WU SUO YOU”. However, the last character, “”, could present three pinyin options: “yǒu”, “yòu”, and “wěi”. It could also be the Cantonese jyutping pronunciation “jau5”. Deciphering which transliterated option the character represents in the title often requires more advanced processing.

Graphic CuiJianExample

Pioneer of Chinese rock music, Cuī Jiàn (1992)

Song titles can also be represented differently depending on whether it has been translated or transliterated. Using the Cuī Jiàn example mentioned above, the song title could be represented in five ways:

  1. “一无所有”: The Mandarin title written in Simplified Han Chinese characters.

  2. “一無所有”: The Mandarin title written in Traditional Han Chinese characters.

  3. “Yī wú suǒ yǒu”: The title written in Mandarin Chinese “pinyin”

  4. “Jat1 mou4 mo2 jau5”: The title written in Cantonese Chinese “jyutping”

  5. “Nothing to My Name”: The title translated into English words.

When applying this to music data, issues can arise when songs need to be matched. If two databases contain the same songs, some with titles written in Han Chinese characters, others written in pinyin, jyutping or translated into English, it becomes challenging to connect titles.

How is Blokur solving this issue?

When music is played, the people who own it should get paid for their product.

However, music is often owned by more than one person or group, and this information is usually spread out in different databases, making it hard to figure out who owns what and how much of it they own.

In order to make sure the right people are paid, the information that exists in these different databases needs to be consolidated. This is where matching comes in.

Blokur’s matching engine combines multiple strategies that assist in processing logographic Asian characters as found in Chinese and Japanese music metadata. This involves using automation to match Chinese song titles written in Han Chinese characters, pinyin or English translations. (See diagram below).

Blokur Tackling Asian Characters Chinese Matching Song titles

Diagram showing three methods Blokur uses to process Chinese song titles

To further improve accuracy, Blokur also incorporates our knowledge base of ISRC (The International Standard Recording Code) numbers enriched with information from our database of 30 million music compositions. This includes data about the artist and their pseudonyms which leads to more matches with higher accuracy. More matches with higher accuracy helps us ensure that the correct creators are remunerated for their work as early as possible.

Blokur Database Example

Examples of Blokur’s matches of track titles in Chinese characters to their transliterated version


To find out more about how Blokur can help with processing music databases with Asian characters, get in touch with us.