

The BQ corpus contains 120,000 question pairs from 1-year online bank custom service logs. This paper introduces the Bank Question (BQ) corpus, a Chinese corpus for sentence semantic equivalence identification (SSEI). Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing The BQ Corpus: A Large-scale Domain-specific Chinese Corpus For Sentence Semantic Equivalence Identification Experimental results show that our approach performs comparably with the classification approach and shows its advantage in classifying emojis with similar semantic meaning. The most similar emoji is chosen as the predicted label. First the distributed representation (tweet vector) for each tweet is generated, then the similarity between this tweet vector and each emoji’s embedding is evaluated. We propose a vector similarity based approach for this task. Instead of regarding it as a 20-class classification problem we regard it as a text similarity problem. This paper describes our participation in SemEval 2018 Task 2: Multilingual Emoji Prediction, in which participants are asked to predict a tweet’s most associated emoji from 20 emojis.

Proceedings of the 12th International Workshop on Semantic Evaluation Peperomia at Sem Eval-2018 Task 2: Vector Similarity Based Approach for Emoji Prediction The experimental results not only demonstrate the good quality of LCQMC but also provide solid baseline performance for further researches on this corpus.


In order to verify the LCQMC corpus, we split it into three parts, i.e., a training set containing 238,766 question pairs, a development set with 8,802 question pairs, and a test set with 12,500 question pairs, and test several well-known sentence matching methods on it. After this process, a question matching corpus that contains 260,068 question pairs is constructed. In this paper, we first use a search engine to collect large-scale question pairs related to high-frequency words from various domains, then filter irrelevant pairs by the Wasserstein distance, and finally recruit three annotators to manually check the left pairs.
#Jing jing chen how to#
How to collect a large number of question pairs in variant linguistic forms, which may present the same intent, is the key point for such corpus construction. LCQMC is more general than paraphrase corpus as it focuses on intent matching rather than paraphrase. To ameliorate this situation, in this paper, we introduce a large-scale Chinese question matching corpus (named LCQMC), which is released to the public1. The lack of large-scale question matching corpora greatly limits the development of matching methods in question answering (QA) system, especially for non-English languages. Proceedings of the 27th International Conference on Computational Linguistics LCQMC:A Large-scale Chinese Question Matching Corpus Proceedings of the 36th Pacific Asia Conference on Language, Information and Computation 2018 We also reported the performance of alignment-based word embedding models on this evaluation dataset, achieving high and significant correlation scores.įrom Frying to Speculating: Google Ngram evidence to the meaning development of ‘?’ in Mandarin Chinese Following the DURel framework, we collected 6,000 human judgments for the dataset. This paper presents the first dataset for evaluating Chinese semantic change in contexts preceding and following the Reform and Opening-up, covering a 50-year period in Modern Chinese. While several methods for detecting semantic change have been proposed, such studies are limited to a few languages, where evaluation datasets are available. Recent research has brought a wind of using computational approaches to the classic topic of semantic change, aiming to tackle one of the most challenging issues in the evolution of human language.
