В работе

Web scraping - Python Script

Job description

I need a python script to scrape the url shown down below. The scraped data for each request should be saved in a json file. Scrapy or headless selenium are encouraged. *If you can intercept the response that is also encouraged.

The website has a strong antibot policy. Therefore it is necessary to bypass detection. Include user-agents, proxy rotation, cookies, session management, or any other necessary measure. The pages take some time to load so it is necessary to wait till it finishes rendering.

The aspx url has parameters that make it easy to structure the scrape with completely independent requests. the year {y} and page {pg} parameters are self explanatory, the {m} and {type} parameters can be inspected by clicking the variables on the left table titled “Modalidad de Compra”.

[login to view URL]

Review the attached image to find the variables. I need to get each row from the table (20 rows per page).

a.1 - variableName: npg; type: string

a.2 - variableName: npg_link; type: string (href)

b - variableName: date; type: string (do not convert to datetime)

c - variableName: status; type: string

d - variableName: guatecompras; type: string (this variable could be ignored)

e - variableName: entity; type: string

f - variableName: modality; type: string

g - variableName: description; type: string

h - variableName: nit; type: string

i.1 - variableName: provider; type: string

i.2 - variableName: provider_link; type: string (href)

j - variableName: amount; type: float (you can leave it as a string, the most important thing is to get the data)

create 4 new variables from the url as to log the parameters used to obtain the data. This will reduce redundant requests.

k - variableName: year; type: string

m - variableName: type_query; type: string

n - variableName: page; type: string

o - variableName: m_query; type: string

p - timestamp

In case a request is unsuccessful log the url with its parameters in a log file (with a timestamp). For some reason there are some pages that do not work and we dont want to send many requests to a broken link.

Send me the json file or files with at least 200 results (independent requests) to ensure that the scraper works fine. Log the errors and let me know which issues you found and what you needed to do to solve them.

Please contact me if youre interested or have questions.

***Bonus job: If the scraper is successful and your code is neat, I will want you to scrape the entire url list and store the data in a postgres database. This is a different job and I will pay separately.***

Навыки: Python, Веб-скрейпинг, Scrapy, Data Scraping, Selenium

О клиенте:
( 2 отзыв(-а, -ов) ) Guatemala, Guatemala

ID проекта: #34387874