If you are a pub quiz specialist or if you spend all day working with music rights data, you might be able to bring these links to mind. And probably almost any competent human could infer that Billy might be short for William. But what if you are a software program processing data with 300 different William Corgans. How can you safely identify which are the Smashing Pumpkins front man, which are his father, and which are just similar-sounding names?
This problem is at the core of the challenge of consolidating song registrations from different publishers. And when it goes wrong, money goes missing.
Imagine that you are one of three publishers involved in a song called Got To Be U by John David Smith and Jane Sophie Doe. Each of the three publishers submits a separate registration.
- The first publisher’s registration is for a song called Got To Be You by John Smith and Jane Doe.
- The second is for Got 2 Be U by Johnny Smith and Sophie Doe.
- And your registration is for Got To Be U by JD Smith and JS Doe.
In order to create a complete picture of the rights, we first need to know that all three publishers are talking about the same song. If two of the three registrations are matched but yours isn’t, there’s a good chance that any royalties you are owed are not going to reach you.
The problem is that on their own neither the titles nor any of the songwriters is an exact match. On text match alone, it’s very unlikely that you will be able to say with confidence that all three publishers are talking about the same song. For this reason, most matching systems rely on title similarity and a lot of manual effort. Our work with publishers suggests that this results in matching failures more than 50% of the time.
The power of the graph
Through a research collaboration with the Digital Catapult, Blokur has developed a new graph search technology to tackle exactly this challenge.
First, the system analyses names and titles not in isolation but as a graph of relationships. This gives us greater context for each of the entities based on their position in the graph: the Johnny Smith we’re looking for is connected to a song called Got 2 Be U and a co-writer called Sophie Doe. That’s already a start. It’s harder to confuse our Johnny Smith with one of the many other Johnny Smiths out there who don’t have these relationships.
Then, instead of searching for a match using a text string, we use the graph itself as our query. The system attempts to find a similar graph somewhere else in the data.
And here comes the fun part. Because our graph of relationships makes it highly unlikely that we would confuse our Johnny Smith with somebody else, we can be less strict about how we write his name. We can go looking for all the names that look at all similar to Johnny Smith — including JD Smith and John David Smith — knowing that we’ll only find a false positive in the unlikely event that another writer with a name similar to Johnny Smith also happens to also be linked to a songwriter that looks similar to Sophie Doe AND a song title that looks similar to Got 2 Be U. If you can still recall your high school mathematics, you’ll know that multiplying together probabilities in this way makes the system’s defence against false positives extremely robust.
More matches means more revenue
The result of all of this is that Blokur is able to generate more candidates for matches and match them with a greater level of accuracy. That means fewer duplicate works, fewer missed conflicts and more successful matches between works and recordings — all driving at more revenue for music publishers.