This project required complex scraping of multiple multi-part, multi-language, multi-format PDFs into CSV. Initially, some documents had no headings requiring some innovation in identifying column text. Also, text in headers, footers and margins needed to be taken into account.Some data processing was also required where the presence of values in particular sections of the PDF would influence the output. The scraper scripts had to continue being modified as the documents evolved. Spanning 2 years more than 10 versions of the scripts have been released.
We are an IT and Data Science company with a combined expertise of 20+ years experience in the IT field - specifically Linux and Windows systems administration, and 5+ years in the data science field - we have expertise in statistical analysis, predictive analysis, machine learning and big data. We have expertise with all major cloud platforms i.e. Amazon Web Services, Google Cloud Platform and Microsoft Azure. Contact us for: + Descriptive and predictive analysis of data, machine learning, big data + R programming and Shiny interfaces/dashboards + Systems integration, web scraping, API creation + Linux systems administration