you will use PHYTON and implement a Hidden Markov Model (HMM) based
Part-of-Speech (POS) tagger for the biomedical domain. The training ([login to view URL]) and test
sets ([login to view URL]), which are obtained from the Genia Corpus, are available.
The training set contains 13677 sentences, and the test set contains 6869 sentences. The training
and test set files contain one token/POS pair per line, and a ========== line (ten equal signs)
is put between sentences.
You should estimate the parameters of your HMM model (i.e., the tag transition and word
likelihood probabilities) from the training set. You should implement the Viterbi algorithm for
decoding (tagging a test set).
For the second phase of the project, you should implement a program which takes the name
of a .txt file which contains any biomedical text as an input. Your program should split the input
file into sentences and then apply the POS tagger that you would implemented in the first phase
for each sentence. At the end, your program should output all noun phrases (not only the nouns!)
in the given biomedical text. You should apply some rules for the extraction of noun phrases
(such as DT + ADJ + N constitutes a NP, and so on so forth)