In Natural Language Processing (NLP) and more particularly in Natural Language Understanding (NLU), named entities have a very important role to play. Indeed, as I explained in my article “How to choose between NLP and NLU for a chatbot?”, intention and named entities allow machines to understand the meaning of statements made orally or in writing in natural language by humans through NLU processes. What are named entities and why are they a real challenge in NLU?
Named Entity Recognition (NER) is the automatic recognition and extraction of words or groups of words that refer to people, companies, organizations, places, dates, times, zip codes, postal or e-mail addresses, etc. In most cases, named entities are usually numbers or proper names. In summary, named entities are core elements in a statement.
Named Entity Recognition is divided in two fields:
– Named Entity Resolution,
– Named Entity Disambiguation.
Named Entity Resolution is the first phase of Named Entity Recognition. It involves identifying named entities in a statement and it works well: most tools available on the market obtain good if not excellent results. The real challenge of the named entities in NLU does not lie in Named Entity Resolution but in their disambiguation, the second phase of Named Entity Recognition.
What is Named Entity Disambiguation?
It is simply a matter of identifying the exact referent of each named entity in a statement. Let’s take the example of a toponymScientific term: toponyms are place names (cities, countries, continents, rivers, mountains, etc.). : “Paris”. Most people will deduce that the referent of this entity is the city of Paris in France. This deduction is perfectly legitimate since the city of Paris in France is the best known and therefore the most popular referent of this toponym. However, there are various possible referents for the toponym “Paris”: Paris in Texas, USA; Paris in Ontario, Canada; Paris in Panama; Paris in Togo, etc. Even more, the entity « Paris » could very well have referred to a person if I hadn’t specified that it was a city in the context of the example. Thus, the referent of a named entity is not always the most obvious one.
Now how do you identify the right referent for a named entity in a statement?
It all depends on the functioning methods and the resources on which the tool is based. The operating modes are very different: some tools randomly select a referent from the list of possible referents, others choose the most popular one from the list, some look for linguistic clues in context to choose a referent from the list, others consider that all entities contained in the same statement are close (especially for toponyms, from a geographical point of view), etc.
The type of resources used for the disambiguation of named entities is very versatile: databases, ontologies, dictionaries, thesaurus, etc. These resources are referred to as « external » because they are not included in the statement of the entity to be disambiguated. Some tools are based either on external data, context, or a combination of context and external resources. The variety of available resources and functioning methods being important, the tools have a wide range of performances and results.
However, an appropriate combination of resources and the way the Named Entity Disambiguation tool works can achieve excellent results for the statements to be addressed. The knowledge related to the tool, and therefore the resources, must be relevant: for example, a tool based on resources from the Renaissance will not be able to perform as well on a 21st century statement about new technologies as on a statement from the Renaissance adapted to the resources.
It is essential to choose a Named Entity Recognition tool according to its needs and use cases. It is necessary to ensure the relevance of the resources on which the tool is based and its functioning method, particularly in terms of disambiguation.
Named Entity Recognition is therefore a field in full expansion and constant improvement. This pursuit of performance is even more important since more and more chatbots rely on named entities contained in statements to respond to users or perform automated actions, such as Hubi through scenarios.
What about Hubi?
It is in the “scenarios” part that the named entities have a major impact in Hubi. The named entities resolution algorithms in Hubi are based on and integrate those developped by Microsoft. To disambiguate the named entities identified in the scenarios, Hubi also relis on Microsoft data but also on internally designed resources.