You can’t assume that a new dataset is clean and ready for analysis. Ch. 2 offers several techniques to take the data for a test drive.
Ch. 3 (Data Intended for Human Consumption, Not Machine Consumption): Some ways to help you extract data (from spreadsheets) into something more usable.
Ch. 4 is about character encoding problems and how to handle them.
Ch. 5 walks you through everything that can go wrong in a web-scraping effort
Data That Does the Unexpected
Using Natural Language Processing (NLP) to detect liars and the confused.
Ch. 9: “When Data and Reality Don’t Match”
Approach
Advice to data scientists from a software developer’s perspective (ch. 8). Note from Sergio: why would you name it Blood, Sweat, and Urine ?????
Ch. 7: Is there such thing as truly bad data?
Ch. 10: How you collect your data determines what will hurt you (bias and error).
Ch. 11: How dirty data will give your classical statistics training a harsh reality check.
Data Storage and Infrastructure
Ch. 13: How you store your data weighs heavily in how you can analyze it. Spotting graph data in a relational database.
Ch. 14: Dissecting assumptions on cloud computing’s scalability and flexibility. Note from Sergio: this book is from 2013 so i expect this to be outdated ??.
Ch. 12: When to stick to files instead of databases.
The Business of Data
Ch. 16: How to out-source machine-learning.
Ch. 15: Several worst practices to avoid when it comes to corporate bureaucracy policy.
Data Policy
Ch. 17: Sure you know what methods you used, but do you truly understand how those final figures came to be? Food for thought for your data processing pipelines.
Ch. 18: Looks to the future of social media, and thinks through a much-needed recall feature. Note from Sergio: Again, this is from 2013….
Ch. 19: How to assess your data’s quality, and how to build a structure around a data quality effort.