References

Semantic technologies and information retrieval in SNCF prescriptive documentation

Reference

Coralie Reutenauer, Luce Lefeuvre, Aurélie Fouqueray, Thibault Prouteau, Valentin Pelloin, Nathalie Camelin, Nicolas Dugué, Cédric Lopez, Frédérique Segond, Didier Bourigault (2020) Technologies sémantiques et accès à l’information dans le prescrit SNCF, 22e Congrès de Maîtrise des Risques et Sûreté de Fonctionnement λµ22, à paraître.

Abstract​

In order to improve documentation production and information retrieval for a better rail safety, several prototypes based on Natural Language Processing technologies and applied to SNCF prescriptive documentation were developed and assessed.

Extraction of tasks in e-mails: a semantic role-based approach

Reference

Melissa Mekaoui, Guillaume Tisserant, Mathieu Dodard, Cédric Lopez (2020), Extraction de tâches dans les e-mails : une approche fondée sur les rôles sémantiques, EGC’2020, p. 193-204.

Abstract​

In 2019, around 1.4 billion e-mails are sent every day in France (293 billion worldwide). E-mails significantly increase the volume of communications in companies. As a result, it is difficult for employees to read all messages in order to identify the tasks to be carried out. First systems to identify tasks in e-mails appeared at the end of the 1990s. Much work has been done on this topic, based on machine learning, symbolic methods, and hybrid methods. Two approaches are commonly adopted: 1) classification of language acts (at the message level or sentence level, 2) information extraction based on linguistic patterns. We propose and experiment with an approach based on event extraction (from Information Extraction) and Semantic Role Labeling to identify and structure tasks in emails. The evaluation of our system on professional e-mails shows the relevance of our proposal.

Towards a resolution of anaphoric relationships in mediated electronic communication

Reference

Hani Guenoune, Cédric Lopez, Guillaume Tisserant, Mathieu Lafourcade, and Melissa Mekaoui (2019) Vers une résolution des relations anaphoriques dans la communication électronique médiée, Actes du colloque Jeunes Chercheurs PRAXILING, p. 139-150

Abstract​

 The task of coreference resolution applied to written texts consists in finding all lexical parts that refer to the same real world entities, properties or situations. In the aim of extracting knowledge from unstructured textual data, this task plays an essential role within the typical natural language processing pipeline. The efficiency of automated resolution systems depends on the one part, on the ressources it relies on, and the redactional nature of the text on the other. After an overview of existing works, this paper will present the issues of dealing with anaphora in the context of electronic communication, the problems one can encounter and consider ways of working around them.

Recursive Named Entity Recognition

Reference

Cédric Lopez, Melissa Mekaoui, Kevin Aubry, Guillaume Tisserant, Hani Guenoune, Mathieu Dodard, Jean Bort and Philippe Garnier (2020) Recursive Named Entity Recognition, Advances in Knowledge Discovery and Management, à paraître.

Abstract​

Named entity recognition (NER) seeks to locate and classify named entities into predefined categories (persons, organizations, brandnames, sports teams, etc.). NER is often considered as one of the main modules designed to structure a text. We describe our system which is characterized by 1) the use of limited resources, and 2) the embedding of results from other modules such as coreference resolution and relation extraction. The system is based on the output of a dependency parser that adopts an iterative execution flow that embeds results from other modules. At each iteration, candidate categories are generated and are all considered in subsequent iterations. The main advantage of such a system is to select the best candidate only at the end of the process, taking into account all the elements provided by the different modules. Another advantage is that the system does not need a large amount of resources. The system is compared to state-of-the-art academic and industrial systems and obtains the best results.

A French text-message corpus : 88milSMS. Synthesis and usage.

Reference

Rachel Panckhurst, Cédric Lopez, Mathieu Roche (2020), A French text-message corpus : 88milSMS. Synthesis and usage. In “Corpus complexes Traitements, standardisation et analyse des corpus de communication médiée par les réseaux”, CORPUS, 21, to appear.

Abstract​

In this article, firstly we briefly summarise the sud4science project and data collection (http://sud4science.org), ensuing processing/analysing stages, and the resulting corpus, 88milSMS (http://88milsms.huma-num.fr), through a synthesis of quotes and references to previous articles (§ 1). Secondly, we provide a state of the art on some research initiatives that use 88milSMS in various domains and frameworks, which will enable future cross-disciplinary insight (§ 2). Then, we present other usages of the 88milSMS corpus we identified through surveys (§ 3). Finally, we suggest future paths for textual data collection and analysis.

Detecting Influencial Users in Social Networks: Analysing Graph-Based and Linguistic Perspectives

Reference

Kévin Deturck, Namrata Patel, Pierre-Alain Avouac, Cédric Lopez, Damien Nouvel, Ioannis Partalas and Frédérique Segond (2019) Detecting influencial users in social networks: Analysing graph-based and linguistic perspectives, Artificial Intelligence for Knowledge Management, p. 113-131.

Abstract​

The detection of influencers has met with increasing interest in the artificial intelligence community in recent years for its utility in singling out pertinent users within a large network of social media users. This could be useful, for example in commercial campaigns, to promote a product or a brand to a relevant target set of users. This task is performed either by analysing the graphical representation of user interactions in a social network or by measuring the impact of the linguistic content of user messages in online discussions. We independently explore both ways in the present paper with a hybridisation perspective. We extract structural information to highlight influence among interaction networks and identify linguistic traits of influential behaviours. We then compute a score of user influence using centrality measures with the structural information and a machine learning approach with the linguistic features.

Iterative Named Entity Recognition from a Syntactic Dependency Structure and the NERD Ontology

Reference

Cédric Lopez, Melissa Mekaoui, Kevin Aubry, Jean Bort and Philippe Garnier (2019) Reconnaissance d’entités nommées itérative sur une structure en dépendances syntaxiques avec l’ontologie NERD, Revue des Nouvelles Technologies de l’Information, RNTI-E-35, p. 81-92 (présenté à Metz à la conférence EGC’19).

Abstract

Named entity recognition (NER) seeks to locate and classify named entities into predefined categories (persons, organizations, brand names, sports teams, etc.). NER is often considered as one of the main modules designed to structure a text. In this article, we describe our symbolic system which is characterized by 1) the use of limited resources, and 2) the embedding of results from other modules such as coreference resolution and relation extraction. The system is based on the output of a dependency parser that adopts an iterative execution flow that embeds results from other analysis blocks. At each iteration, candidate categories are generated and are all considered in subsequent iterations. The advantage of such a system is to select the best candidate only at the end of the process in order to take into account all the elements provided by the different modules. The system is compared to academic and industrial systems.

Resources

Wikipedia-ner : Download

Corpus developed by Emvista for named entity recognition. This corpus was built from Wikipedia abstracts. It consists of 587 abstracts and 3 125 named entities annotated with the BIO encoding and the concepts of the NERD ontology. See the publication for more details.
This corpus is under Creative Commons License Creative Commons CC-BY-NC-SA et LGPL-LR.

Le tour du monde en quatre-vingts jours, de Jules Verne, 1872 : Download

This corpus in WML format was initially annotated and disseminated by the LIFAT, with 12 named entity types (persons, organizations, location, vessels, facilities, oronyms, …). With the agreement of LIFAT, we propose a new version of this corpus in CSV format with a projection of NERD ontology types (place, person, organization, product, …). 6076 tokens are annotated with this ontology. This corpus is under Creative Commons License Creative Commons CC-BY-NC-SA et LGPL-LR.

SMILK, linking natural language and data from the web

Reference

Cédric Lopez, Molka Tounsi Dhouib, Elena Cabrio, Catherine Faron-Zucker, Fabien Gandon, Frédérique Segond (2018) SMILK, trait d’union entre langue naturelle et données sur le web, Revue d’Intelligence Artificielle, vol. 32/3, p. 287-312

Abstract

As part of the SMILK Joint Lab, we studied the use of Natural Language Processing to: (1) enrich knowledge bases and link data on the web, and conversely (2) use this linked data to contribute to the improvement of text analysis and the annotation of textual content, and to support knowledge extraction. The evaluation focused on brand-related information retrieval in the field of cosmetics. This article describes each step of our approach: the creation of ProVoc, an ontology to describe products and brands; the automatic population of a knowledge base mainly based on ProVoc from heterogeneous textual resources; and the evaluation of an application which that takes the form of a browser plugin providing additional knowledge to users browsing the web.