Search Bot Crawler - Perl

Hi, We've got more work then we can handle at the moment and are therefore considering outsourcing this project. Our company requires a crawler, to index thousands of external websites quickly and efficiently. We'd prefer it coded in Perl, although C++ or PHP proposals may be considered. Indexed information must be retreivable from a MySQL or PostgreSQL database in under a second. Site visitors will be able to submit their website for spidering. Once approved by us, the link is the activated. The spider will then need to go to this link and spider that page and any other page on that site. It must not go outside of that websites domain though. We have run tests with scripts such as PHPDig, although with a database of just 4,000,000 entries, it starts to get clogged up and rarely finishes it's spidering. Your script must therefore be bug free, error correcting and use the lowest possible processing power. The script must index all HTML, PHP, ASP, Perl and Coldfusion page extensions and ignore image links. A big plus would be the ability to index PDF files. It will also ignore all HTTP error codes and follow HTTP redirection codes. The database structure must be optimised for pulling data out. You must be able to provide the SQL commands required to pull data out of the system, including boolean search algorithms. The ability to easily change the search engines ranking algorithm would also be a plus. We look forward to receiving your quotes.

Квалификация: MySQL, Perl, PHP, Python, Веб-скрейпинг

О работодателе:
( 0 отзыв(-а, -ов) )

ID проекта: #16875093