Issue 29: This Week in Data (March 2, 2021)
(1) Facial recognition is front and center with BIPA fines for Facebook and Tiktok. (2) LinkedIn upgrades Gobblin with config-driven integrations (3) Deepfake Tom Cruise is a treat and trick in one.
Hi TWID,
Can you believe it is February already? I mean March. As our J&J vaccine is approved and 78.6M overall jabs have been given in the US, we see hope in the horizon. As we sit anxiously in front of our computers, at least we can do is read about this week in data.
Data Privacy and Security
The Privacy Corner
BIPA is on fire. Illinois' biometric information privacy act (BIPA) has been increasingly relevant as facial recognition becomes widespread and an easy-to-implement software technology. It is one of the first states to have a biometrics privacy law (from 2008) and has carried many of the US privacy lawsuits.
Done. A judge has ordered Facebook to pay $650M in violation of BIPA (Biometric Information Privacy Act) for using facial recognition without user consent back in 2015 in instagram.
Nearly Done. Tiktok has agreed to pay $95M in violation of BIPA that facial features were sent to China and third parties without user consent. Tiktok disputes the charge, but is willing to settle. This awaits the final judge's approval.
Escalated. Clearview AI plans to escalate its BIPA fight to the Supreme Court based on a procedural argument related to standing. My non-legal interpretation is that Clearview wants the case to escalate from state court to federal court to escape BIPA (which is a state law).
Not BIPA but face related. NYT has a piece on how Massachusetts successfully wrote a broad facial recognition law limiting its use in criminal investigations.
In the realm of Consumer Privacy, we have a few updates on the usual fights. A number of bills have been debated for months if not years, and this is a grind we keep fighting for.
Just a law is not enough. Several consumer groups, including EFF, called for the Virginia governor to veto the recently passed Consumer Data Protection Act (CDPA) for insufficiently protecting consumers.
More on the way. FastCompany helped compile a digestible overview of upcoming data privacy laws across the US: Nevada, Vermont, Maine, Virgina, New York, Washington, Utah, and Oklahoma
Data Breaches also never end. In an effort to talk about companies we know, here are some recent data breaches.
Gab. Right wing social media app had 70GB of its content downloaded when exploited via SQL injection. The data is loaded to DDoSecrets, which has been a recent oasis for hacktivists breaching companies they don't agree with.
Sequoia. The VC firm was breached via a successful phishing attempt. Details are light, but so far no content has shown up in the dark web.
Data Infrastructure
The Infra Corner
Infrastructure often gets compared to plumbing, absolutely critical but sometimes struggles to get respect and glory. One coping mechanism is creating our own little communities where we talk about infrastructure challenges that are often common across similar organizations. Often times, these open source technologies have great slack channels or mailing lists where developers can discuss the shared struggle.
Kafka. It's easy to forget that Pinterest has a ton of data. The Pinterest team talks about high scale challenges with Kafka such as disk maintenance, dynamic rebalancing, data transfer costs, etc. Tons of great real-world tips, though may not be relevant for most of us who are running at a fraction of the scale.
Druid. Reddit replaced their advertising reporting infra from Redis to Druid and saw great gains (availability from 99.5 → 99.9)
Gobblin. LinkedIn talks about their creation of the Data Integration Library to parameterize and speed up the process of building custom connectors. On top of Gobblin, of course.
One part of the codebase that keeps growing extremely fast is the extractor library that interacts with various sources (i.e., connectors) because of the variety and the large number of B2B vendors. Previously, custom connectors were built to meet specific requirements. This “bag of connectors” strategy led to the following categories of problems...
To achieve the goal of efficiently supporting existing business cases and expediting go-to-market for new ones, we needed a different approach to address the variety problem. We could not simply refactor common elements into libraries; we needed to also create a configuration-driven framework that allowed for degrees of customization, which would ultimately enable rapid results. (Linkedin Data Integration Library)
Spark. Databricks releases Spark 3.1.1, which includes Project Zen to improve PySpark, improvements for ANSI SQL compliance, improved performance on predicate pushdown and shuffling, and streaming improvements.
Small Bytes
The ACM Conference for Fairness, Accountability and Transparency has suspended its partnership with Google over recent controversies in its ethical AI team.
Nathan Yau shows how minimum wage has changed in each state over time. A simple but fun visualization that gives us a sense of time between 1968 to 2021.
The National Security Commission on Artificial Intelligence releases the a massive 756-page mega-report on artificial intelligence. It's chaired by Eric Schmidt of google fame and full of tech people. The report is huge, is in a nice little reactive website, and will probably lead to a massive investment in AI capabilities within the government.
Biden is already proposing a $37B legislation to address recent chip shortages and the US semiconductor supply chain.
Fake Tom Cruise on TikTok (deepfaketom) is hilarious but (upon reflection) deeply troubling for what videos we can trust in the future.
McDonalds thinking of selling its personalized suggestions startup Dynamic Yield after a $300M acquisition back in 2019. Personalization hype has made it to traditional industries, but have to continue to prove themselves in the real world.
Interworks gives a quick viz of every BI Magic Quadrant since 2011. We can see how some companies crawled to the top right, and others fall off the graph. Watch as Tableau slowly crawled to relevance, as MicroStrategy slowly become a bitcoin bank.
Corpo Updates
Google and MongoDB have joined in a partnership that connects Mongo Atlas w/ Google downstream integrations.
Microsoft releases Azure Precept as an end-to-end edge AI ecosystem, as well as Azure Arc for containerized ML workloads. This news comes from the Microsoft Ignite Conference from 3/2-3/4.
Datastax releases Astra serverless computing platform. It is essentially attempting to horizontally scale Cassandra through a microservice-y architecture.
Amazon announces Amazon Lookout for their Vision product, which helps detect defects in manufactured goods.
Industry and Fundraising
QualityMatch - $6M for data annotation
Prophecy.io - $6.75M for drag and drop spark data eng pipelines
TripleBlind - $8.2M for model training on encrypted data
January AI - $8.8M for predicting diabetic patients' responses to food
SoundCommerce - $15M for ecommerce and retail insights
Hubilo - $23.5 for virtual conference events enriched with realtime analytics
Katana Graph - $28.5M for unstructured data
Symbio Robotics - $30M for factory automation with AI
PerimeterX - $57M to fight bots
Beam Dental - $80M for AI powered dental insurance
ScienceLogic - $105M for AI operations
ChartIO acquired by Atlassian for undisclosed for data visualization capabilties in the app. (ChartIO will be sunset)
This Week in Data is a weekly newsletter to help you stay up to date with developments in the data ecosystem. My goal is to bring focus on broader data trends to data professionals and enthusiasts who are interested in data and its applications. Topics include infrastructure, AI/ML, experimentation, analytics/BI, privacy, security.