Tek Capital company has already set up Chinese automatic Q&A system based on short text similarity algorithm. The company now is looking for NLP expert who can create an efficient English short text similarity algorithm based on the Chinese data set provided by the company (we will provide you with machine translation engine API designed by our company). Each question submitted by users should be given a score from 0 to 1 based on its similarity rate to the original set of questions.
You can use tensorflow or keras K-means to create clusters to enhance accuracy or any other OPEN SOURCE algorithms to achieve excellent result based on the standard evaluation ratings of accuracy, precision, recall and F1-score. For example, Knowledge-based measures quantify semantic relatedness of words using a semantic network. Many measures have shown to work well on the WordNet large lexical database for English. WordNet, which is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. WordNet’s structure makes it a useful tool for computational linguistics and natural language processing. (WordNet is also freely and publicly available for download at [login to view URL])
Work Result Requirements:
(accuracy) = (TP+TN)/(TP+FN+FP+TN) = 95%
(precision) = TP/(TP+FP) = 95%
(recall) = TP/(TP+FN) = 9.5/10
The script should be in python or java and more details will be share via private message.
You must have lots of experience with natural language processing and familiar with popular off-the-shelf word embedding models such as Word2Vec (by Google), GloVe (by Stanford) or fastText (by Facebook) and open source language resources in GitHub pre-trained multilingual language models (LM) and other related NLP online resources downloadable to our server.
Attached is Chinese Data Set, you only need to check the Questions in the file. The Answers can be ignored because we are looking to compare new questions submitted by users to the old questions in the Data Set and get the score from 0 to 1 for the new questions with high accuracy, precision and recall rate.
Reference Article for this project: [login to view URL]@adriensieg/text-similarities-da019229c894