“Natural Language Processing”
Natural language processing is a discipline of artificial intelligence. It covers the understanding and the generation of words, ‘as a human would do’ from the perspective of a particular use: automatic translation, speech recognition, chatbot generation, identification of feelings, or text ranking which is the subject of application of talk4.
Each of these applications uses different algorithmic techniques, but here are some basic concepts relevant to all:
Tokenization and standardization:
These are 2 steps prior to processing a text or sentence. The ‘Tokenization’ is to cut a text into smaller components: a text is cut into a sentence, a sentence is cut into words to better treat them later.
Normalization consists of expurging sentences, punctuation, abbreviations, digits, etc., in order to be able to apply numerical modeling.
During these pretreatment stages of the texts, one also excludes the ‘stop-words’ for example the articles which does not bring anything to the logic of the sentence.
Stemming or Lemmatization:
These 2 techniques consist of finding the root of a word. The stemming merely eliminates the prefixes, suffixes, conjugated forms, etc., where the lemmatization will focus on finding the root word from a derived form (better vs good).
A set of texts that will be used for learning or testing. A corpus can be thematic or significant of a linguistic subset on which we will make an apprenticeship.
Language modelization : formal or digital
It is a question of representing a word or a sentence by a numerical ‘value’, so then can carry out mathematical or statistical treatments which will generate the expected results. Several types of modeling have been explored by NLP researchers according to the types of applications. For simplicity, there are 2 big schools:
- Modelization of relations between words and their representation in a formal way in the form of graphs.
- Modelization of characters, words or texts in the form of vectors and multidimensional matrices
It is an artificial intelligence program based on numerical optimization techniques, which has identified a ‘pattern’ from the data analysis done during a learning, will be able to apply it to new data of same nature, but never met.
Supervised learning and unsupervised learning :
Machine learning methods used for the development of Machine learning programs. Learning is said to be supervised when it applies to a set of selected data because they are characteristic of what one seeks to reproduce. It is unsupervised when learning is done on data that we do not know a priori.