Text categorization, the unsupervised way

Konstantin Glushak
Universal Language
Published in
6 min readMay 23, 2016

--

As part of the work in Smartling is to have fun, recently we took part in Smartling’s first company-wide Hackathon. I think we had a cool project and an awesome team, so here is what we did.

Smartling is cloud translation platform that unites vendors who need translation and translation agencies in a single place. Since we want our clients to have the smoothest experience possible, we challenge ourselves with sometimes hard questions. For example, what is the ideal way for a customer to choose a translator? Can we help them to do that? What should the criteria be?

To try to answer these questions, we decided to implement a recommendation system as a hackathon project, that would define the main “topic/category” of the customer’s content and recommend translators that had the most experience in that topic.

From technical point of view, this problem is called “text categorization” and it has largely been solved by modern NLP algorithms. One might even call it trivial, because even Naive Bayes works well here (the assumption that each word in a text is independent from the other words and thus you can disregard the words order). But to use this algorithm, we would need to have a tagged database, with a category assigned to each text.. Also, such database would need to cover most of the common topics. Unfortunately, we didn’t find such database and here is where our task became much more complicated and interesting.

Luckily for us, machine learning techniques made great progress lately. For those of you who might be familiar with these techniques — we used word2vec (to get a vector representation for any word), KNN (to cluster those words-vectors into groups), TF-IDF (to give a weight to each word in a text), cosine distance (to find the distance between 2 vectors) and some code to glue all this stuff to accomplish the task. But if you are unaware of these approaches, you’re even more welcome to read about our approach to tackling this task.

Our solution can be divided into 2 phases: “Labeled Words database preparation”, and “decision maker”.

Labeled Words database preparation:

To prepare our Labeled Words Database, we used subset of our Smartling databases, containing 200,000,000 words, of which 500,000 were unique. Full processing of a DB took about 1.5h on a local core i5 machine or 50min on x8.large Amazon instance.

One of the most interesting algorithm that we used was Word2vec (actually word2vec is a set of algorithms). This algorithm represents words as a set of numbers (a vector of a predefined size). More scientifically, we can say that this algorithm finds a representation of a word in an N-dimensional space, where dimension of a space is human-defined (usually it is a 100–300 dimensional-space). These word spaces have some fascinating properties:

  1. Semantically similar words would be close to each other in this space, while words with different meanings would be far from each other.
  2. Some vector-operations from linear algebra are applicable to this representation and still have semantic sense.

Let me show you some examples. “Synonyms” for the word “Smartling” (words that are closest to Smartling) were:

word2vec_model.most_similar(positive=[‘smartling’])

translation, 0.5598170757293701,

multilingual, 0.55828857421875

memsource, 0.5165267586708069

localization, 0.5156410932540894,

using translation, 0.5070928931236267,

simpleoctober, 0.5038389563560486,

smartlings, 0.49914202094078064

The “antonyms” for word “Smartling” were:

opposite, 0.3299219608306885,

seemed, 0.3262189030647278,

scribbled, 0.32137537002563477,

chilling, 0.3178943693637848,

love\u2026i, 0.31303125619888306,

drinking, 0.2875804305076599,

deteriorated, 0.2847713530063629,

persuasively, 0.28190189599990845,

ecmilesexpire’, 0.27841582894325256

So, it’s not always easy to interpret :).

Here are some more examples on the properties of the algorithm that were taken from http://deeplearning4j.org/word2vec.html. Synonyms for “Sweden” would be the following:

As for the vector operations:

vec(Italy)-vec(Rome)+vec(Beijin) == vec(China)

vec(queen)-vec(king)+vec(man) ~~ [woman, Attempted abduction, teenager, girl]

And a few more fun examples:

  • Geopolitics: Iraq — Violence = Jordan
  • Distinction: Human — Animal = Ethics
  • President — Power = Prime Minister
  • Library — Books = Hall
  • Analogy: Stock Market ≈ Thermometer

Clustering

So, after we could represent our words in a vector space we applied a common clustering technique to our unique words space. K-Nearest Neighbours (KNN) separated all the words into 100 clusters that became our labels.

As KNN is a well-known algorithm, I won’t describe how it works, but will just leave this picture here:

At this step we created our missing Labeled Words database. Of course, those labels didn’t have the human-readable names, but still it was enough for similarity comparison.

Decision maker:

So, after we had a cluster number for any given word in our dataset, we decided to determine the weight of each word in a sentence. We did it for the obvious reason that word “Messi” would tell you much more about the topic of an article, than the word “the” or “sense”. In order to do that we used TF-IDF.

TF-IDF

This algorithm defines how important a given word is in a text. It does so on the assumption that the less common a word is among all the texts in a DB, and the more often we can see it in the text — the more important this word is for the text.

As we had 100 clusters for the words, for each new sentence we created a new vector where the i-th element would show how much the given article belongs to the i-th cluster. As a result we had a vector of 100 real numbers that showed how much article belongings to each cluster.

So here we could stop and assign a single cluster to an article, but we decided that it’s pretty common that a single article might belong to several clusters (topics), and so the comparison was made on a vector basis.

We had 2 options to do that: euclidean distance and cosine distance. We decided to go with the latter, because in this case we wouldn’t need to normalize vectors on their length, although it’s not a problem. Basically, cosine distance is an angle between the vectors. The smaller the angle, the closer the vectors.

Results evaluation

As we used unsupervised learning for our database, it’s hard to evaluate the performance of our results, but we did some “field” comparison using random news from google news feed.

tech vs migrants 0.139902945449

tech vs films 0.107041635505

tech vs crime 0.129078335919

tech vs tech 0.0573367725147

migrants vs films 0.0687836514904

migrants vs crime 0.0641768366374

migrants vs tech 0.114785947853

films vs crime 0.0270321456571

And here are the results. We think that they are pretty solid, because our system found that “migrants”, “crime” and “film” are pretty similar, and if we understand that “migrants” and “crime” can occur in articles about problems with the law. But our “film” article was about Sherlock Holmes… The similarity of 2 tech articles is also pretty obvious.

--

--