Web Data Extraction

Scope :

Develop a system using Apache Nutch, Apache Haddop and Apache Solr to crawl the pages @100 (configurable) for given websites on round robin basis and store automatically in the particular folder on hadoop by using the name of websites.

Some websites ask for authentication i.e. User id & Password, Hence system should be capable enough to pass the user id & password dynamically at runtime by reading the information from text file or configuration (XML) file. The system should be able to store multiple user credentials and provide them in a round robin basis.

Crawled pages will be stored in the respective site folders on Apache Hadoop.

Crawled page contents and metadata will be stored and indexed in Solr with following fields.

All the documents like pdf, videos, audio, doc, docx, jpeg, png etc will be stored in folders with clear identification i.e. with url so that web page can be reconstructed from the content.

The crawling will be a focused crawling where first the meta data is extracted and passed on to a API which either passes or fails it. If passed, the whole page content is extracted and processes further. The API will be provided as a part of the project.

Solr Fields:

• Site

• Title

• Host

• Segment

• Boost

• Digest

• Time Stamp

• Url

• Site Content (Text)

• Site Content (HTML)

• Metadata (Keywords, Content)

• Metadata (Description, Content)•


[url removed, login to view] (URLs)

Typical Steps:

1. The first step is to load the URL State database with an initial set of URLs. These can be a broad set of top-level domains such as the 1.7 million web sites with the highest US-based traffic, or the results from selective searches against another index, or manually selected URLs that point to specific, high quality pages.

2. Once the URL State database has been loaded with some initial URLs, the first loop in the focused crawl can begin. The first step in each loop is to extract all of the unprocessed URLs, and sort them by their link score.

3. Next comes one of the two critical steps in the workflow. A decision is made about how many of the top-scoring URLs to process in this loop.

4. Once the set of accepted URLs has been created, the standard fetch process begins. This includes all of the usual steps required for polite & efficient fetching, such as [url removed, login to view] processing. Pages that are successfully fetched can then be parsed.

5. Typically fetched pages are also saved into the Fetched Pages database.

6. Decision on whether page has to be crawled or not will be done based on the given object. The meta data is passed on to the object and If the given object return true then page will be crawled otherwise page will be discarded.

7. Page rank computation: Calculate the importance of page based on algorithm provided by nutch/solr

8. Once the page has been scored, each outlink found in the parse is extracted.

9. The score for the page is divided among all of the outlinks.

10. Finally, the URL State database is updated with the results of fetch attempts (succeeded, failed), all newly discovered URLs are added, and any existing URLs get their link score increased by all matching outlinks that were extracted during this loop.

Part II.

Classification of extracted pages

1. Run the pages into the classified API

2. Depending on the classification returned, store the page into that folder along with the relevance score.


Crawled pages will be stored in the respective site folders on Apache Hadoop.

Crawled page contents and metadata will be stored and indexed in Solr.

Tools and Techniques:

Apache Nutch, Solr, Apache Hadoop

Local system

Test Case:

1. check crawl data and xml file in respective folders.

2. Search query parameter in xml and text files.

Навыки: Веб-скрейпинг

Показать больше: data extraction nutch, xml process, web searches database, web scraping techniques, web scraping process, web scraping part time, web scraping api, web content classification, top segment, tools to develop websites, text search algorithm, text matching algorithm, sort algorithm, scraping web content, scraping tools web, results focused, png to txt, index search algorithm, indexed data, increased web traffic

О работодателе:
( 2 отзыв(-а, -ов) ) Mumbai, India

ID проекта: #5168918

3 фрилансеров(-а) готовы выполнить эту работу в среднем за $166


Dear Client, I can help in your project. We have already experience of working on similar projects. Please see below to get idea of our experience: Amazon/Ebay Bots: [login to view URL] Больше

$206 USD за 5 дней(-я)
(32 отзывов(-а))

A proposal has not yet been provided

$144 USD за 3 дней(-я)
(2 отзывов(-а))

I am Data Entry ,MS Word and MS Excel Expert. i am very much professional in this work i am pretty sure that you cant find a best person for this job like me so i am ready to work on your project with low rate and high Больше

$147 USD за 3 дней(-я)
(1 отзыв)