bad_data

Notes while reading Bad Data Handbook

View the Project on GitHub chekos/bad_data

Chapter 1 - Setting the Pace: What is Bad Data?

The way the book is organized:

  1. Guidance for Grubby, Hands-on Work
    • You can’t assume that a new dataset is clean and ready for analysis. Ch. 2 offers several techniques to take the data for a test drive.
    • Ch. 3 (Data Intended for Human Consumption, Not Machine Consumption): Some ways to help you extract data (from spreadsheets) into something more usable.
    • Ch. 4 is about character encoding problems and how to handle them.
    • Ch. 5 walks you through everything that can go wrong in a web-scraping effort
  2. Data That Does the Unexpected
    • Using Natural Language Processing (NLP) to detect liars and the confused.
    • Ch. 9: “When Data and Reality Don’t Match”
  3. Approach
    • Advice to data scientists from a software developer’s perspective (ch. 8). :bulb: Note from Sergio: why would you name it Blood, Sweat, and Urine ?????
    • Ch. 7: Is there such thing as truly bad data?
    • Ch. 10: How you collect your data determines what will hurt you (bias and error).
    • Ch. 11: How dirty data will give your classical statistics training a harsh reality check.
  4. Data Storage and Infrastructure
    • Ch. 13: How you store your data weighs heavily in how you can analyze it. Spotting graph data in a relational database.
    • Ch. 14: Dissecting assumptions on cloud computing’s scalability and flexibility. :bulb: Note from Sergio: this book is from 2013 so i expect this to be outdated ??.
    • Ch. 12: When to stick to files instead of databases.
  5. The Business of Data
    • Ch. 16: How to out-source machine-learning.
    • Ch. 15: Several worst practices to avoid when it comes to corporate bureaucracy policy.
  6. Data Policy
    • Ch. 17: Sure you know what methods you used, but do you truly understand how those final figures came to be? Food for thought for your data processing pipelines.
    • Ch. 18: Looks to the future of social media, and thinks through a much-needed recall feature. :bulb: Note from Sergio: Again, this is from 2013….
    • Ch. 19: How to assess your data’s quality, and how to build a structure around a data quality effort.

:bulb: Note from Sergio: this book is v white.

Previous chapter Next chapter