PDF Data TILT Information

Purpose

Data scientists and statisticians must be able to take data that is in its natural, messy form and transform it into tidy data, making and documenting the transformations and considering their impact on inference and interpretation of the data. Many organizations release data in PDF format that they do not release in a more standard text-based format (CSV, XLSX, XML, etc.), and sometimes, it is necessary to extract information from these PDF files. This lab is designed to help you practice the skills necessary to extract data from PDFs, so that if you are ever in an unfortunate situation where you need to extract data from a PDF file, you have some familiarity with the tools and the process.

Skills

This assignment will help you practice the following skills which are important for being able to access and work with PDF files:

  • Identify the type of PDF file
  • Develop a strategy for accessing the data, determining whether it is necessary to use OCR, text-processing, or other tools to get the data out of the PDF file.
  • Extract the text data from the PDF file
  • Clean and format the extracted text to produce tabular data for analysis
    • Data cleaning and wrangling skills
    • Text processing and regular expression skills
  • Implement quality control checks to correct common errors which occur during the extraction process

Knowledge

This assignment will help you to become familiar with important knowledge in this discipline:

  • PDF file structure (which is important for making accessible documents in addition to OCR and data extraction) and formats
  • Use of new libraries to accomplish a task (tesseract, tabulapdf, camelot, etc.)

Success Criteria

General Criteria

Task specific Criteria

  1. Warming Up
    1. PDF Format
    2. PDF Structure
    3. Plan of Attack
    4. Acquiring Metadata
    5. Anomalies and Strategy Adjustments
  2. Extracting the Text
    1. Identify Relevant Pages
    2. Read in Relevant Text
    3. Plan Your Approach
    4. Would Coordinates Help?
    5. Explore Package Documentation
    6. A Classical Problem
    7. Quality Control