I have html files saved and need to have these converted into one of two options:
1. A table that includes the part of the call (discussion or Q&A), name of executive, full executive title, CEO or CFO, and text containing that executive's speech. I only need the CEO and CFO for each transcript, so this would involve eliminating the remaining text.
2. A text or word file with headers Executive name - title - part of call with the text following.
The visual formats of the transcripts are not uniform, but I believe the html tags are (i.e. <strong>executive name<strong>. I have R studio and Anaconda, and can run the code myself but I cannot write it. Ideally I would be able to run a loop statement to tackle many files, and would obtain a txt or doc file of the output for each transcript titled the same as the html file.
If there could be a built in command to eliminate stop words and numbers/punctuation terms that would be helpful. These I can write, but I'm not sure how comfortable I would be adding them to the other code.
I've attached an example of one transcript (html file), one sample excel output, and one sample doc output. The timing on this is ASAP as I've been spinning wheels for about a week now.