adesso Turkey Blog



Natural Language Processing (NLP) a subset technique of Artificial Intelligence that is used to fill the communication gap between the Computer and Human and perform tasks about understanding of human language.

NLP is considered a difficult problem in computer science. It’s the nature of the human language that makes NLP difficult. Comprehensively understanding the human language requires understanding both the words and how the concepts are connected to deliver the intended message.

While humans can easily master a language, the ambiguity and imprecise characteristics of the natural languages are what make NLP difficult for machines to implement.

In this blog we propose a history of NLP with a mix of artificial intelligence and deep learning.

Early 1900s

A Swiss linguistics professor named Ferdinand de Saussure died, and in the process, almost deprived the world of the concept of "Language as a Science". Professor Saussure offered three courses at the University of Geneva, where he developed an approach describing languages as "systems." Within the language, a sound represents a concept – a concept that shifts meaning as the context changes.

Mid 1930

A paper tape system as an automatic bilingual dictionary created by Georges Artsrouni. Petr Troyanskii proposed a automatic bilingual dictionary with a scheme for coding interlingual grammatical roles (based on Esperanto).


Warren Weaver use cryptographic techniques. An application of Claude Shannon information theory and statistics and speculations about universal principles underlying natural languages.


Alan Turing wrote a paper describing a test for a "thinking" machine. He stated that if a machine could be part of a conversation through the use of a teleprinter, and it imitated a human so completely there were no noticeable differences, then the machine could be considered capable of thinking.


The Hodgkin-Huxley model showed how the brain uses neurons in forming an electrical network.


The public demonstration of a Russian-English machine translation system in New York. A collaboration of IBM and Georgetown University, caused a great deal of public interest and much controversy.


Noam Chomsky published his book, "Syntactic Structures". he revolutionized previous linguistic concepts, concluding that for a computer to understand a language, the sentence structure would have to be changed. With this as his goal, Chomsky created a style of grammar called Phase-Structure Grammar, which methodically translated natural language sentences into a format that is usable by computers.


The programming language LISP (Locator/Identifier Separation Protocol), a computer language still in use today, was released by John McCarthy.


ELIZA, a "typewritten" comment and response process, designed to imitate a psychiatrist using reflection techniques, was developed. (It did this by rearranging sentences and following relatively simple grammar rules, but there was no understanding on the computer’s part.)

The STUDENT problem solving system created by Bobrow, Daniel G., programmed in LISP, accepts as input a comfortable but restricted subset of English which can express a wide variety of algebra story problems.

The U.S. National Research Council (NRC) created the Automatic Language Processing Advisory Committee, or ALPAC, for short. This committee was tasked with evaluating the progress of Natural Language Processing research.


The NRC and ALPAC initiated the first AI and NLP stoppage, by halting the funding of research on Natural Language Processing and machine translation.

After twelve years of research, and $20 million dollars, machine translations were still more expensive than manual human translations, and there were still no computers that came anywhere near being able to carry on a basic conversation. Artificial Intelligence and Natural Language Processing (NLP) research was considered a dead end by many (though not all).


The AI stoppage had initiated a new phase of fresh ideas, with earlier concepts of machine translation being abandoned, and new ideas promoting new research, including expert systems. The mixing of linguistics and statistics, which had been popular in early NLP research, was replaced with a theme of pure statistics.

With the result of both the steady increase of computational power, and the shift to Machine Learning algorithms, researchers has increasingly focused on statistical models. These statistical models are capable making soft, probabilistic decisions.


IBM was responsible for the development of several successful, complicated statistical models.


The popularity of statistical models for Natural Language Processes analyses rose dramatically. The pure statistics NLP methods have become remarkably valuable in keeping pace with the tremendous flow of online text.

N-Grams have become useful, recognizing and tracking clumps of linguistic data, numerically.


The idea of multi-task learning was first proposed by Rich Caruana. Multi-task learning encourages the models to learn representations that are useful for many tasks. This is particularly useful for learning general, low-level representations, to focus a model’s attention or in settings with limited amounts of training data.


LSTM recurrent neural net (RNN) models were introduced.


Multi-task learning was applied to road-following and pneumonia prediction by Rich Caruana.

From 2001 to 2007

Neural language models era.


Neural language models starting to became popular. Language modelling is the task of predicting the next word in a text given the previous words. It is probably the simplest language processing task with concrete practical applications such as intelligent keyboards and email response suggestion.

Yoshio Bengio and his team proposed the first neural "language" model, using a feed-forward neural network. The feed-forward neural network describes an artificial neural network that does not use connections to form a cycle. In this type of network, the data moves only in one direction, from input nodes, through any hidden nodes, and then on to the output nodes. The feed-forward neural network has no cycles or loops, and is quite different from the recurrent neural networks.

Dense vector representations of words or word embeddings have been used.


LSTM models found their niche for voice and text processing. Neural net models are considered the cutting edge of research and development in the NLP’s understanding of text and speech generation.

From 2008 to 2012

Multi-task learning era.


Multi-task learning is a general method for sharing parameters between models that are trained on multiple tasks. In neural networks, this can be done easily by tying the weights of different layers.

Multi-task learning was first applied to neural networks for NLP by Collobert and Weston. In their model, the look-up tables (or word embedding matrices) are shared between two models trained on different tasks. Sharing the word embeddings enables the models to collaborate and share general low-level information in the word embedding matrix, which typically makes up the largest number of parameters in a model.


Feed-Forward Neural Networks have been replaced with recurrent neural networks (RNNs; Mikolov et al., 2010) and long short-term memory networks (LSTMs; Graves, 2013) for language modelling.


Apple’s Siri became known as one of the world’s first successful NLP/AI assistants to be used by general consumers. Within Siri, the Automated Speech Recognition module translates the owner’s words into digitally interpreted concepts.


Word Embedings

The main innovation that was proposed by Mikolov et al. was to make the training of these word embeddings more efficient by removing the hidden layer and approximating the objective. While these changes were simple in nature, they enabled—together with the efficient word2vec implementation—large-scale training of word embeddings.

Word2vec comes in two flavours: continuous bag-of-words (CBOW) and skip-gram. They differ in their objective: one predicts the centre word based based on the surrounding words, while the other does the opposite.

Word embeddings can also be learned via matrix factorization (Pennington et al, 2014; Levy & Goldberg, 2014) and with proper tuning, classic matrix factorization approaches like SVD and LSA achieve similar results (Levy et al., 2015).

RNNs and CNNs both treat the language as a sequence. From a linguistic perspective, however, language is inherently hierarchical: Words are composed into higher-order phrases and clauses, which can themselves be recursively combined according to a set of production rules. The linguistically inspired idea of treating sentences as trees rather than as a sequence gives rise to recursive neural networks.


Sequence-to-sequence models

2013 and 2014 marked the time when neural network models started to get adopted in NLP. Three main types of neural networks became the most widely used: recurrent neural networks, convolutional neural networks, and recursive neural networks.

A general framework for mapping one sequence to another one using a neural network called sequence-to-sequence learning proposed Sutskever et al. In the framework, an encoder neural network process a sentence symbol by symbol and compresses it into a vector representation; a decoder neural network then predicts the output symbol by symbol based on the encoder state, taking as input at every step the previously predicted symbol.


Memory-based and Attention-based networks

Attention can be seen as a form of fuzzy memory where the memory consists of the past hidden states of the model, with the model choosing what to retrieve from memory.

Memory-based networks comes in different variants such as Neural Turing Machines (Graves et al., 2014), Memory Networks (Weston et al., 2015) and End-to-end Memory Newtorks (Sukhbaatar et al., 2015), Dynamic Memory Networks (Kumar et al., 2015).

Neural networks have demonstrated the ability to directly learn to produce such a linearized output given sufficient amount of training data for constituency parsing (Vinyals et al, 2015).

Attention is one of the core innovations in neural MT (NMT) and the key idea that enabled NMT models to outperform classic phrase-based MT systems (Bahdanau et al., 2015). The main bottleneck of sequence-to-sequence learning is that it requires to compress the entire content of the source sequence into a fixed-size vector. Attention alleviates this by allowing the decoder to look back at the source sequence hidden states, which are then provided as a weighted average as additional input to the decoder.

Different forms of attention based models are introduced (Luong et al., 2015). Attention is widely applicable and potentially useful for any task that requires making decisions based on certain parts of the input. It has been applied to consituency parsing (Vinyals et al., 2015), reading comprehension (Hermann et al., 2015).


A killer application of sequence-to-sequence model, Machine translation turned out.

A sequence-to-sequence learning framework due to its flexibility is the go-to framework for natural language generation tasks, with different models taking on the role of the encoder and the decoder. Importantly, the decoder model can not only be conditioned on a sequence, but on arbitrary representations. Sequence-to-sequence learning can even be applied to structured prediction tasks common in NLP where the output has a particular structure. For simplicity, the output is linearized as can be seen for constituency parsing.

Google announced that it was starting to replace its monolithic phrase-based MT models with neural MT models (Wu et al., 2016). Google’s Jeff Dean has reported that 500 lines of TensorFlow code has replaced 500,000 lines of code in Google Translate.

Named entity recognition (NER) as an application of neural networks learning based on linearized output of training data introduced (Gillick et al., 2016).

The Neural Differentiable Computer introduced (Graves et al., 2016).

The Neural Programmer-Interpreter (NPI) introduced by Google DeepMind (Reed et al., 2016).


Even Bengio et al.’s classic feed-forward neural network is in some settings competitive with more sophisticated models as these typically only learn to consider the most recent words (Daniluk et al., 2017).

In multi-task learning sharing of parameters is typically predefined, different sharing patterns can also be learned during the optimization process showed by Ruder et al.

Recurrent Entity Network introduced (Henaff et al., 2017).


Pretrained language models

The classic LSTM remains a strong baseline (Melis et al., 2018).

The paper in 2008 by Collobert and Weston proved influential beyond its use of multi-task learning. It spearheaded ideas such as pretraining word embeddings and using convolutional neural networks (CNNs) for text that have only been widely adopted in the last years. It won the test-of-time award at ICML 2018.

Facebook researchers win Test of Time Award at ICML 2018

Multi-task learning is gaining in importance and dedicated benchmarks for multi-task learning have been proposed.(Wang et al., 2018; McCann et al., 2018)


ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations

Author: Reza Ebrahimi

Senior Machine Learning Engineer


Ferdinand de Saussure. (2020). Retrieved 19 February 2020, from

History of machine translation. (2020). Retrieved 19 February 2020, from

Granell, X. (2014). Multilingual Information Management (1st ed.). Chandos Publishing.

Hodgkin and Huxley: Superheroes. (2020). Retrieved 19 February 2020, from

Hutchins, W. (2004). The Georgetown-IBM Experiment Demonstrated in January 1954. Machine Translation: From Real Users To Research, 102-114.

Chomsky, N., & Lightfoot, D. (2002). Syntactic structures (2nd ed.). Mouton de Gruyter.

defmacro - The Nature of Lisp. (2006). Retrieved 19 February 2020, from

Landsteiner, N. (2005). Eliza (elizabot.js). Retrieved 19 February 2020, from

Bobrow, D. (1964). Natural Language Input for a Computer Problem Solving System. MIT. Retrieved from

Leonard-Barton, D., & Sviokla, J. (1988). Putting Expert Systems to Work. Harvard Business Review. Retrieved 19 February 2020, from

Kumar, P. (2020). An Introduction to N-grams: What Are They and Why Do We Need Them?. XRDS. Retrieved 19 February 2020, from

Sebastian Ruder (2017). An Overview of Multi-Task Learning in Deep Neural Networks. arXiv preprint arXiv:1706.05098.

Le, J. (2019). Recurrent neural networks: The powerhouse of language modeling. Built In. Retrieved 20 February 2020, from

Understanding LSTM Networks - colah's blog. (2015). Retrieved 20 February 2020, from

Rich Caruana at Microsoft Research. Microsoft Research. (2020). Retrieved 20 February 2020, from

Ruder, S. (2018). A Review of the Neural History of Natural Language Processing - AYLIEN. AYLIEN. Retrieved 20 February 2020, from

Sebastian Ruder. (2017). Multi-Task Learning Objectives for Natural Language Processing. [online] Available at: [Accessed 20 Feb. 2020].

Collabert, R., & Weston, J. (2020). Retrieved 20 February 2020, from

Reed, S., & de Freitas, N. (2016). Neural Programmer - Interpreters. In ICLR. United Kingdom.

Henaff, M., Weston, J., Szlam, A., Bordes, A., & LeCun, Y. (2017). Tracking the World State with Recurrent Entity Networks. Retrieved 20 February 2020, from

Representations, A. (2019). ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations. Google AI Blog. Retrieved 20 February 2020, from

Save this page. Remove this page.