Stat 351. Statistical Computing II - Data Management and Visualization

Course Description

Computational skills for management, visualization and analysis of large and complex data which are necessary for modern statistics. Includes a wide range of topics necessary for data analytics, including harvesting data from websites and common data structures, setting up and working with databases, and designing interactive data displays.

Learning Objectives

  • Access and leverage data stored in formats which are commonly used outside of statistics (HTML, JSON, XML, PDF, APIs) and transform these data to formats which are used for statistical analysis.
    • Scrape data off of the internet and assemble it into a “tidy” format for visualization and analysis.
    • Read in structured data from record-based formats (XML, JSON) and transform this data to a table-based format.
    • Use optical character recognition and other tools to extract data from a PDF file systematically.
    • Use an API to request data from an online service.
    • Implement data cleaning and quality control measures to ensure that data is read in correctly.
  • Develop skills for visualization and communication of complex data using interactive graphics. You will be able to
    • Determine when an interactive chart is preferable to a static chart.
    • Create an interactive chart using JavaScript-based tools such as Plotly, Observable.js, or Shiny.
    • Integrate your interactive chart into a report or web page, along with supportive text describing the chart and important findings.
  • Understand and leverage data management tools for storing and manipulating data, including
    • Identifying situations where an external database is preferable to working with data in-memory.
    • Accessing data in an external SQL, Parquet, or Arrow database.
    • Discussing the tradeoffs between different tools for data management and different approaches to data storage.
    • Design an analysis strategy for large data which does not fit into computer memory by selecting from strategies such as sampling and split-apply-combine.