
First, the choice of expressions was arbitrary. Scores are not absolute, for two reasons. It should be noted that GT alters its vocabulary choice on the fly, so the words chosen to translate a phrase may not be those selected in a longer sentence for example, “run of the mill” is represented in French by “course du moulin” in isolation, and “course de l’usine” when translating a longer tweet (both completely wrong), and other instances might produce other results. Nor did we test full sentences, which add a lot of complexity to scoring, since translations might include both correct and incorrect elements moreover, humans can translate the same sentence many ways, making it impossible to impose a gold standard for full sentences, especially across dozens of languages. Although GT does not advertise itself as a dictionary, single-word lookups are a major proportion of real world uses of the service. We did not test single words, which can be highly ambiguous (for example, right = correct, legal entitlement, politically conservative, not left, etc.), and therefore too arbitrary in isolation.

We tested clusters of two or more words that often occur together, and that have a meaning that is generally discernable when they do (as can be readily seen in Twitter searches) for example, tweets with “out cold” almost always imply unconsciousness.
