I need a tool that will work similarly to [login to view URL] pipeline and support following languages PL DE IT EN UA FR CZ
The tool should process parallel texts in pure text format, as can be found in the [login to view URL] repository (to be more precise, moses format). Based on file extension, the program should automatically detect what language it is.
The tool should have the following capabilities, executed one after another in exactly that order:
step 1: reducing the whole text to lowercase letters (it should be optional and disabled by default)
Step 2: pre-clean the text (optional, standard enabled, we want to use [login to view URL] scripts) i.e. [login to view URL], [login to view URL], [login to view URL], [login to view URL] – maybe you will find something else essential?
Step 3: normalize punctuation marks (optional, standard enabled), we want to use the same tool as here: [login to view URL] i.e. [login to view URL]
Step 4: Tokenization - should be performed with the use of the SpaCy tool, and for the Polish language SpaCy-pl [login to view URL]
Step 5: Truecasing - (optional, standard enabled) you can use a fragment of [login to view URL] because the whole thing comes from [login to view URL] anyway.
I don't have any prepared models, I want such models to be trained based on the input data and then applied on the same data. Just like it is done in Moses
Step 6: division into units smaller than words with the BPE algorithm [login to view URL] (optional function, standard on with a 50,000 dictionary) it must be possible to adjust the size of the dictionary with the appropriate parameter.
The result should be pure text encoded in utf8, in the same format as the input format. The number of lines MUST MATCH, the text must be still PARALLEL after processing. The program should write on the console what it is currently doing, it should easily work under Linux Ubuntu control and be easy to install. Ideally it should provide an installation script. You will also need to create short documentation and user manual with simple examples.
We have discussed the project in the chat so I just trying to put here enough characters to bid, cause here should be more than 100 characters.
8 фрилансеров(-а) готовы выполнить эту работу в среднем за $194
Hey, I can help you in NLP Python Linus Tool In how much time you want it to be completed???? Let's talk upon your project Waiting for your response!!!
Hi! Agnieszka K. I have read your job description and assure you that I am a perfect fit for the job. Available NOW and can start Immediately. Looking for soonest reply from you. Thanks