Issue 32: This Week in Data (Mar 31, 2021)
(1) Ted Chiang interview on the future of AI (2) The Pudding visualizes gender bias in beauty products. (3) Benchmark datasets such as ImageNet found to have labeling errors for over 5% of items.
Hi all,
Hope you’ve had an amazing week. No nonsense today, please subscribe if you’d like this in your inbox each week!
Data Stories and Journalism
Stories and journalism tackle the ethics of data
[ Future ] AI in Science Fiction
“The Trolley Solution" explores a world where AI replaces human teachers. This is part of Slate's commissioned series of short stories and pre-written analyses of these stories. The pair of articles are a great thought exercise on the current state of the world and how imaginations of AI can be closer than they seem. This reminds me of 1954 novel "The Caves of Steel" by Asimov where humans are suspicious of the robot detectives.
Ted Chiang takes an interview w/ Ezra Klein where they discuss implications of AI in the future (alongside other things). For those of you who love Ted Chiang, this will be a treat.
“So then as for the third question of, should we do so, should we make machines that are conscious and that are moral agents, to that, my answer is, no, we should not. Because long before we get to the point where a machine is a moral agent, we will have machines that are capable of suffering. Suffering precedes moral agency in sort of the developmental ladder.” (Ted Chiang in interview with Ezra Klein, March 30, 2021)
For those of you who don't know who Ted Chiang is, he's a science fiction short story writer (one of my favorite). He got his public relations boost when his story was adapted for the movie Arrival. His two published short story collections are "Story of Your Life and Others" and "Exhalation".
[ Present ] Data Journalism and Journalism in Data
The Data Journalism Handbook has been republished, updated from its original 2012 edition. The 400+ page book is a collection of articles and how-to's written by dozens of journalists across the world. It is free for digital download (link)(pdf - whole book)
Khari Johnson, a long time contributor for VentureBeat is going to Wired Magazine. VB has some of the best data reporting out there, and excited to see this talent spread across more magazines.
The Pudding puts out a data journalism piece on bias in how complexion products are named. The data validates the fact that "nude" and "natural" branding across all brands reflect disproportionately lighter shades of skin.
[ Present ] The usual ethics of surveillance and bias
Vice reports on how New Orleans musicians and dancers beat facial recognition through a series of grassroots campaign that ended in legislation in Dec 2020. The fight against a surveillance state will have to happen on all layers of government.
Finally, on December 17, 2020, the ordinance was passed, and banned four pieces of technology: cell site simulators (often called "Stingrays"), predictive policing, characteristic recognition and tracking software, and facial recognition.
Brookings Institute argues for international treaties around AI use. Many elements in the article are focused around the regulation of AI powered conventional and cyber weapons. (missiles, nukes, hacking power grids, etc.), somewhat related to how we think about chemical and biological weapons.
MIT study shows that data labeling for popular datasets is not perfect, and find systemic errors in the ground truth data in popular datasets such as ImageNet of up to 5-10% error rate, depending on the dataset.
The models benchmarked on inaccurate corpuses ranked differently when given a clean corpus, as simpler models performed more competitively on accurately labeled datasets.
They published a site called "labelerrors.com" to show examples of incorrect tagging in these datasets. Check it out!

Small Bytes
Booking.com was fined 475K euro by the Dutch Data Protection Authority for failing to report a data breach within 72 hours.
ARM releases ARMv9, a significant release 10 years after the previous ARMv8. As a observer to this what I can understand is...
“…expect a massively altered security landscape, along with improvements to vector math (which in turn means improvements in AI/ML and Digital Signal Processing, among other applications). (Ars Technica)
Boston Dynamics debuts Stretch, a warehouse robot angling to become a staple of moving boxes around warehouses.
Digital artwork NFT drawn by AI robot Sophia pulls in $688,888. The NFT craze is significantly overblown, but some of it is driven by digital native things such as AI-generated paintings.
Here’s a survey that demonstrates just how realistic GAN-generated faces are. TLDR people can't tell the difference.
Tableau CEO Adam Seplisky leaves to become CEO of AWS. Adam drove Tableau's Cloud play + Salesforce acquisition in his tenure.

Corporate Blogs
[Pinterest x Flink] Image Similarity in Real time. The Pinterest team talks about their near time pipeline used to do image duplicate detection and a full reverse image search across their platform.
[Etsy x Ads] Etsy Ads talks contextual Bidding. The team talks about how they implemented contextual bidding in the Etsy ads platform by predicting the post-click conversion rate to optimize the bid price.
[Pinterest x Open Source] Pinterest sources QueryBook, a tool used to run SQL queries on a notebook.
[Swiggy x ML Embeddings] Embeddings for Food Search using Siamese Networks. The team discusses their model approach to build embeddings for a complex network of food items
[Shopify x CDC] Shopify talks about making their move to Change Data Capture (CDC) w/ Debezium, migrating from a batch extraction system. A nice post where the little details can provide insight into these companies' infrastructures.
[NYTimes x Data Interviews] The team revamps their Data Science SQL Interviews based on learnings from their previous assessments as the evaluated between live coding, take-home assessments, and white-boarding exercises.
Industry and Fundraising
Cere Network - $5M for decentralized data cloud
Feedback Loop - $14M for market research
1910 Genetics - $22M for AI for drug discovery
Zoomin - $52M for extracting data from knowledge base content
Opswat - $125M for protecting infrastructure from malware and zero-days
UI Path files S-1 as RPA company
This Week in Data is a weekly newsletter to help you stay up to date with developments in the data ecosystem. My goal is to bring focus on broader data trends to data professionals and enthusiasts who are interested in data and its applications. Topics include infrastructure, AI/ML, experimentation, analytics/BI, privacy, security.