Parsing HTML files that may not comply with HTML DOM

I have 1TB of htm files.

These files have been collected from the same 30 websites since 2011.

When I'm trying to write a parser for these (python,nlp,beautifulsoup,decruft,etc.) large portions of known text does not appear. Upon review I notice that some pages are putting the actual page content into json strings that live within javascript elements. This makes the parsing process very cumbersome for me. One example of this behavior is the current [login to view URL] index page.

This project will have 1 deliverable. A conversation with me to discuss modern technologies and techniques that can be used to parse these pages, store the results, and make it available for searching.

After our conversation a second project may be opened on freelancer exclusively for the winning bidder to perform further work related to the parsing of HTML based on their suggestions provided from our discussion on this project.

Квалификация: HTML, Javascript, Python, Архитектура ПО, Веб-скрейпинг

Показать больше html parser, html parser online, html parser java, parse html c#, parse html with regex, node html parser, javascript parse html, parse html php, dom html, html submit current time, read dom html source, parse big xml files using dom, java classes parsing files, parsing files java, php parsing files, dom html javascript convert money, parsing files unix, dom html document webbrowser, display dom html document, html menu current

О работодателе:
( 12 отзыв(-а, -ов) ) Troy, United States

ID проекта: #18626886

6 фрилансеров(-а) в среднем готовы выполнить эту работу за $40


This can be done by combining different techniques including regular expressions. I have huge experience with parsing HTML files. Ready to start immediately. Please contact with details if you are interested. Thank you Больше

$30 USD за 1 день
(92 отзывов(-а))

Hello, If there are 30 websites in total I think the best approach would be creating a separated layout for every one of them. So let's say on example [login to view URL] uses JSON format, I would create special layout only for CN Больше

$30 USD за 1 день
(44 отзывов(-а))

Hi, I am a web scrapping expert. I have worked extensively on NLP. What i think you need is combination of both of these in order to parse your data correctly. Let us discuss further details in chat. I will be gla Больше

$100 USD за 1 день
(22 отзывов(-а))

Hello, After reading this job description, I am convinced that I would be a perfect fit for this role. That is because of - I have a firm grasp of front-end dev skills with HTML5/CSS3/Bootstrap/jQuery. - I have n Больше

$25 USD за 1 день
(22 отзывов(-а))

I'm an expert Python developer and I've done web scraping. For that reason I think I'm the best candidate for this work... Have you tried to run the Javascript from the HTML? This is a good solution but could be sl Больше

$30 USD за 1 день
(1 отзыв)

Hi there! My name is Yamila and I teach Python Programming at the University level. I checked the [login to view URL] source and I see what you mean. I don't know about the other 29 websites, I haven't seen them, but for the CNN o Больше

$25 USD за 1 день
(0 отзывов(-а))