Our experimental results on real-world datasets demonstrate the efficiency and efficacy of FLAN. We further theoretically show the robustness of our algorithm with upper bounds on the false positive and false negative rates of corrections. In addition, FLAN does not require any annotated data or supervised learning. Compared with existing approaches, our method is more efficient, both asymptotically and in empirical evaluations, and does not rely on additional features, such as lexical/phonetic similarity or word embedding features. We also propose a novel stabilization process to address the issue of hash collisions between dissimilar words, which is a consequence of the randomized nature of LSH and is exacerbated by the massive scale of real-world datasets. We efficiently handle the pairwise word-to-word comparisons via Locality Sensitive Hashing (LSH). Our algorithm relies on the Jaccard similarity between words to suggest correction results. In this paper, we present FLAN, a scalable randomized algorithm to clean and canonicalize massive text data.
Moreover, real-world, web-scale datasets contain hundreds of millions or even billions of lines of text, where the existing text cleaning tools are prohibitively expensive to execute over and may require an overhead to learn the corrections. However real-world text datasets contain a significant amount of spelling errors and improperly punctuated variants where the performance of these models would quickly deteriorate.
Many popular machine learning techniques in natural language processing and data mining rely heavily on high-quality text sources. Preliminary evaluation shows that the word-model can be used for decoding texting lan- guage words to their standard counterparts with more than 80% accuracy. The model parameters have been estimated from a word-aligned SMS and standard English parallel corpus, through machine learning techniques. The structure of the HMM is novel and arrived at through linguistic analysis of the SMS data. For every word in the standard language, we construct a Hidden Markov Model that succinctly represent all possible variations of that word in the texting language along with their associated obser- vation probabilities. In this work we formally investigate the nature and type of com- pressions used in SMS texts, and based on the find- ings develop a word level model for the texting lan- guage. An urge towards shorter message length facilitating faster typing and the need for seman- tic clarity, shape the structure of this non-standard form known as the texting language. Language usage over computer mediated dis- courses, like chats, emails and SMS texts, signif- icantly differs from the standard form of the lan- guage.