Glossary term matching in Smartling

Pavlo Myrotiuk
Universal Language
Published in
5 min readOct 30, 2019

--

What is a Glossary?

A Glossary is a set of words (terms) that need to be handled with care during the translation process (business domain-specific terms, trademarks/motos, legal terms, etc). All translations must be consistent across all text. If translations of different texts (or even part of text) are done by multiple translators, it is required that these terms have the same translation within the same domain (text, pages).

Problem: find glossary terms in the provided text.

This problem looks pretty simple at first glance. What can be more obvious than searching for a word in text:

Loading all glossary terms from the database and matching it against the provided text. It should be as simple as:

Pattern p = Pattern.compile("term");
Matcher m = p.matcher("text to search term in");
boolean b = m.matches();
© www.meridianbs.co.uk

© www.meridianbs.co.uk

Smartling’s glossaries are a powerful tool, with lots of config options per term that may complicate patterns significantly. Some examples:

  • `Exact` term means that we want to search for words that match a glossary term exactly. For example (term “Join”):

Should be matched: If I join now, will I get a 10% discount?

Shouldn’t be matched: Joining the military, for most, can be a life-altering decision.

  • Case sensitivity. For example (term “Peachy”):

Should be matched: Peachy Color Palette

Shouldn’t be matched: A peachy shade that’s perfect for summer months.

Usually, if the client doesn’t restrict their search by any of the configuration options, we want to be flexible and suggest possible options for translators. For example, if we have the word “run”, as glossary term, it should be found in the sentence: “He runs very fast”. Here “run” and “runs” (with he/she/it) is the same word.

Combining the two options results in four complex regular expressions:

NON_EXACT_MATCH_CASE_SENSITIVE_TERM_MATCH = "(?:^|\\s|>|\\W)(%s(s|es)*)(?=\\s|$|<|\\W)";NON_EXACT_MATCH_CASE_INSENSITIVE_TERM_MATCH = "(?i)(?:^|\\s|>|\\W)(%s(s|es)*)(?=\\s|$|<|\\W)";EXACT_MATCH_CASE_INSENSITIVE_TERM_MATCH = "(?i)(?:^|\\s|>|\\W)(%s)(?=\\s|$|<|\\W)";EXACT_MATCH_CASE_SENSITIVE_TERM_MATCH = "(?:^|\\s|>|\\W)(%s)(?=\\s|$|<|\\W)";

Hopefully, I won’t be the engineer stuck debugging and fixing these patterns.

Going to mark these with a comment //don’t touch

Commit, deploy to prod, solved. Ok, that must be it.

…Yeah… almost.

What is a developer’s favorite kind of task?

Right… performance issues.

In some cases, our service starts to respond with TimeOut. That can’t be good…

It appeared that some of our clients have glossaries that contain as much as 1 million words. During a request, the client may need to find terms from multiple glossaries in a single block of text.

After research, we found out that Pattern.compile() is not as efficient as we would like it to be.

Let’s check the pattern of usage. We can see that each client usually works with the same set of glossaries during some period of time, but for different texts. Ok, let’s add caching… That is easy, the words are the same, let’s compile it and add the compiled representation of patterns to a cache. Good idea, right? Almost… We are restricted to some memory limits (cache can’t be limitless), and our service has a pretty high load, multiple clients are constantly matching text against their glossaries. Basically, when client “A” sends the second request to our service, it is very likely that cache is overwritten with data for client “F”, and there is little help from it.

What else can be done? Let’s move this logic to the database, DBs are designed for searching and matching. Moving logic to DB helps a bit, but not dramatically, plus we still have TimeOuts from time to time.

We continued to look further, and it appeared that a simple String.contains() operation worked well enough. It actually solved the problem for some time.

What usually happens next… you deploy and forget about the legacy service, and develop a new service with cutting edge technologies. No, it doesn’t work that way… usually, a product manager walks up and says “we want unicorns jumping all-around.” 😄

The next feature request was: Find Glossary term “ran” (past tense of “run”) in the sentence “I was running through the woods.”

A contains function won’t help here… We need Lemmatization.

There are a few lemmatization libraries, but the most appropriate for our use cases appeared to be CoreNLP from Stanford. It can lemmatize really large texts and tons of glossary terms within a second. Moreover, it can split sentences by words, so you can easily compare a glossary term against words in text with theequals operation. Beyond that, it can be tuned with dozens of different “Annotations” (that is like extension/plugin/configuration), by turning some of them on you may get part of speech, link entity mentions to Wikipedia entities and many more.

So we changed our algorithm to:

  1. Lemmatize glossary term. Term “ran” -> lemma: “run”
  2. Split sentence by words and lemmatize it to get the lemmatized representation of a sentence (list of lemmas): [“I”, “be”, “run”, “through”, “the”, “wood”]. Other than just lemmas, it actually produces an object that contains an initial word, lemma, and position in a text.
  3. Filter lemmatized text for the needed glossary term and return it along with a position for the UI to highlight it in a sentence.

This works pretty well, but even if a client tells you something like “all we need is Lemmatization, Lemmatization is all we need”, “throw everything away and leave lemmatization logic only”. That might not always be what they really expect. The next request that you’re going to resolve is why the glossary term “AI” is not found in “Why AIs Won’t Take Over The World”. And why would it? From the context it is not 100% clear what “AIs” is. On the other hand, it is matched in “Why deep-learning AIs are so easy to fool”. Here we have more context. At least “are” shows us that “AIs” term possibly is a plural form.

Sometimes clients expect us to find “quarter” term in the sentence “A quarterly newsletter is distributed to members”, but lemmatization won’t help us in this case. So we came up with a hybrid solution of running regex matching on some of the glossary terms and lemmatization on the rest of them.

That is it,

[“thanks”, “for”, “read”, “this”] (matching score 85% 😃)

--

--