Skip to main content
https://www.highperformancecpmgate.com/rgeesizw1?key=a9d7b2ab045c91688419e8e18a006621

IBM builds a more diverse million-face dataset to help reduce bias in AI

Encoding biases into machine learning models, and in general into the constructs we refer to as AI, is nearly inescapable — but we can sure do better than we have in past years. IBM is hoping that a new database of a million faces more reflective of those in the real world will help.

Facial recognition being relied on for everything from unlocking your phone to your front door, and being used to estimate your mood or likelihood to commit criminal acts — and we may as well admit many of these applications are bunk. But even the good ones often fail simple tests like working adequately with people of certain skin tones or ages.

This is a multi-layered problem, and of course a major part of it is that many developers and creators of these systems fail to think about, let alone audit for, a failure of representation in their data.

That’s something everyone needs to work harder at, but the actual data matters as well. How can you train a computer vision algorithm to work well with all people if there’s no set of data that has all people in it?

Every set will necessarily be limited, but building one that has enough of everyone in it that no one is effectively systematically excluded is a worthwhile goal. And with its new million-image Diversity in Faces (DiF) set, that’s what IBM has attempted to create. As the paper introducing the set reads:

For face recognition to perform as desired – to be both accurate and fair – training data must provide sufficient balance and coverage. The training data sets should be large enough and diverse enough to learn the many ways in which faces inherently differ. The images must reflect the diversity of features in faces we see in the world.

The faces are sourced from a huge 100-million-image dataset (Flickr Creative Commons), through which another machine learning system prowled and found as many faces as it could. These were then isolated and cropped, and that’s when the real work started.

These sets are meant to be ingested by other machine learning algorithms, so they need to be both diverse and accurately labeled. So the DiF set has a million faces, and each one is accompanied by metadata describing things like the distance between the eyes, the size of the forehead, and all that. All these measurements together create the “faceprint” that a system would use to, for example, match one image to another of the same person.

But any given set of those measurements may or may not be good for identifying people, or accurate for a certain ethnic group, or what have you. So the IBM team put together a revised set that not only includes simple things like distances between features, but how those measures relate to one another, for example how the ratio of this area above the eyes to that area below the nose. Skin color, as well as contrast and types of coloration, are also included.

In a move that is long overdue, gender in the set is detected and encoded according to a spectrum, not a binary. As gender is itself nonbinary, it makes sense to represent it as any fraction between 0 and 1. So what you really have is a metric describing how individuals present on a scale from feminine to masculine.

Age is also automatically estimated, but for these two last values a sort of “reality check” is also included in the form of a “subjective annotation” field in which people were asked to label faces male or female and guess at age. Here there may be bias re-encoded, as sourcing from humans tends to introduce it. All these make for a considerably broader set of measurements than any other publicly available facial recognition training set.

You may wonder why race or ethnicity isn’t a category — IBM’s John R. Smith, who led the creation of the set, explained in an email to me:

Ethnicity and race are often used interchangeably, although the first is more related to culture and the second is related to biology. The boundaries within either are not distinct, and labeling is highly subjective and noisy as found in prior work. Instead, we chose to focus on coding schemes that could be determined reliably and have some kind of continuous scale that could feed diversity analysis. We may return to some of these subjective categories.

Even with a million faces, however, there’s no guarantee that this set is adequately representative — that enough of all groups and sub-sets are present to prevent bias. In fact, Smith seems sure it isn’t, which is really the only logical position.

We could not ensure this in this first version of the data set. But, it is the goal. First, we need to figure out the dimensions for diversity. We do that by starting with data and coding schemes as in this release. Then we iterate. Hopefully, we bring along the larger research community and industry in the process.

In other words, it’s a work in progress. But so is all of science, and despite the frequent missteps and broken promises, facial recognition is inarguably a technology we will all be engaging with in the future, whether we like it or not.

Any AI system is only as good as the data on which it’s built, so improvements to the data will trickle down for a long time. Like any other set DiF will likely go through iterations addressing shortcomings, adding more content, and integrating suggestions or requests from researchers using it. You can request access here.

Comments

Popular posts from this blog

Uber co-founder Garrett Camp steps back from board director role

Uber co-founder Garrett Camp is relinquishing his role as a board director and switching to board observer — where he says he’ll focus on product strategy for the ride hailing giant. Camp made the announcement in a short Medium post in which he writes of his decade at Uber: “I’ve learned a lot, and realized that I’m most helpful when focused on product strategy & design, and this is where I’d like to focus going forward.” “I will continue to work with Dara [Khosrowshahi, Uber CEO] and the product and technology leadership teams to brainstorm new ideas, iterate on plans and designs, and continue to innovate at scale,” he adds. “We have a strong and diverse team in place, and I’m confident everyone will navigate well during these turbulent times.” The Canadian billionaire entrepreneur signs off by saying he’s looking forward to helping Uber “brainstorm the next big idea”. Camp hasn’t been short of ideas over his career in tech. He’s the co-founder of the web 2.0 recommendatio...

How the world’s largest cannabis dispensary avoids social media restrictions

Planet 13 is the world’s largest cannabis dispensary. Located in Las Vegas, blocks off the Strip, the facility is the size of a small Walmart. By design, it’s hard to miss. Planet 13 is upending the dispensary model. It’s big, loud and visitors are encouraged to photograph everything. As part of the cannabis industry, Planet 13 is heavily restricted on the type of content it can publish on Instagram, Facebook and other social media platforms. It’s not allowed to post pictures of buds or vapes on some sites. It can’t talk about pricing or product selection on others.   View this post on Instagram   A post shared by Morgan Celeste SF Blogger (@bayareabeautyblogger) on Jan 25, 2020 at 7:54pm PST Instead, Planet 13 encourages its thousands of visitors to take photos and videos. Starting with the entrance, the facility is full of surprises tailored for the ‘gram. As a business, Planet 13’s social media content is heavily restricted a...

Billionaire clothing dynasty heiress launches Everybody & Everyone to make fashion sustainable

Veronica Chou’s family has made its fortune at the forefront of the fast fashion business through investments in companies like Michael Kors and Tommy Hilfiger . But now, the heiress to an estimated $2.1 billion fortune is launching her own company, Everybody & Everyone , to prove that the fashion industry can be both environmentally sustainable and profitable. There’s no argument about the negative impacts of the fashion industry on the environment. The textiles industry primarily uses non-renewable resources — on the order of 98 million tons per year. That includes the oil to make synthetic fibers, fertilizers to grow cotton, and toxic chemicals to dye, treat, and produce the textiles used to make clothes. The greenhouse gas footprint from textiles production was roughly 1.2 billion tons of CO2 equivalent in 2015 — more than all international flights and maritime shipments combined (and a lot of those maritime shipments and international flights were hauling clothes). The lit...