Data Wrangling

Metric Definition

Gini Coefficient:

\[\frac { \sum _{ i=1 }^{ n }{ \sum _{ j=1 }^{ n }{ { t }_{ i }{ t }_{ j }\left| { p }_{ i }-{ p }_{ j } \right| } } }{ 2{ T }^{ 2 }P(1-P) } \]

Where \(n\) is the number of regions, \({t}_{i}\) is the total population of region \(i\), \({p}_{i}\) is the minority population proportion in region \(i\), \(T\) is the total population across all regions, and \(P\) is minority population proportion across all regions.

The Gini Coefficient defines segregation by the evenness of a population. It essentially describes the average difference in minority population proportions across all regions in a city, expressed over the maximum difference in the city to give a proportion from 0 to 1, with higher values indicating more segregation (average difference is closer to the max difference). \({p}_{i}\) and \({t}_{i}\) give tell us the size of a minority population in a region, which we can then compare across regions, and normalize against the total minority population (\(P\) and \(T\)). This metric is great for gauging differences between regions, as it specifically compares distances between all regions. It is also a comparison of spatial distributions, something easy to visualize and understand. However, it is naive to think that the physical locations of a minority population is the only thing that contributes to segregation. This measure also leaves out possibly important factors such as region location and detailed dynamics within a region, such as the size of two groups being compared.

Correlation Ratio:

\[\frac { (I-P) }{ (1-P) } ;\quad I=\sum _{ i=1 }^{ n }{ \left[ \left( \frac { { x }_{ i } }{ X } \right) \left( \frac { { y }_{ i } }{ { t }_{ i } } \right) \right] } \]

Where \(n\) is the number of regions, \(I\) is the isolation index, \(P\) is the minority population proportion across all regions, \({x}_{i}\) is the minority population of area \(i\), \({y}_{i}\) is the majority population of area \(i\), \(X\) is the total minority population across all regions, and \({t}_{i}\) is the total population of region \(i\).

The Correlation Ratio is a method of measuring the potential contact between minority and majority group members, indicating the extent to which two groups share common residential areas. This measure is an adjusted version of the Isolation Index, which measures the probability a minority person shares an area with another minority person, correcting for the possibility of more than one minority group. It produces a value from 0 to 1, with higher values indicating more segregation. The isolation index is determined by looking at the proportion of minority members \({{\left( \frac{{x}_{i}}{X} \right)}}\) and proportion of majority group members \(\left( \frac{{y}_{i}}{{t}_{i}} \right)\) in a region. The correlation ratio then takes the isolation index and puts it in the context of the total minority proportion in a city \(P\). This is a good metric to use if you want more insight on how living in a segregated area can affect a person’s life, outside of where they live. Howeverm this metric does doesn’t realate one region to another at all, which prevents us from seeing changes across a city.

Delta Index:

\[0.5\sum _{ i=1 }^{ n }{ \left| \left( \frac { { x }_{ i } }{ X } \right) - \left( \frac { { a }_{ i } }{ A } \right) \right| } \]

Where \(n\) is the number of regions, \({x}_{i}\) is the minority population of area \(i\), \(X\) is the total minority population across all regions, \({a}_{i}\) is the area of region \(i\) in square meters, and \(A\) is the total area across all regions in square meters.

The Delta Index measures the concentration of a minority group. This metric gives us the proportion of minority members living in areas with above average proportions of minority people. It can be looked at as the proportion of a group that would have to move to different regions to get a more uniform density. The metric finds this by looking at the absolute differences in fraction of total minorities and fraction of total area for a given region, \(\left( \frac {{x}_{i}}{X} \right) -\left( \frac {{a}_{i}}{A} \right)\). One of the features of the Delta Index is that it uses area data to better understand the physical regions were people live. Unfortunatly, it uses only one other souce of data in it’s measurements, which could leave out important information. Also, this metric does not compare between regions, only looking at the total. This makes it hard to look at trends between regions.

Metric Comparison

After computing these metrics, we can directly compare the segregation of various cities:

City	Gini	Correlation	Delta
Baltimore	0.77	0.44	0.40
Charleston	0.57	0.23	0.55
Chicago	0.70	0.38	0.52
Columbus	0.58	0.25	0.56
Dayton	0.68	0.37	0.64
Denver	0.48	0.13	0.81
Kansas City	0.60	0.27	0.85
Memphis	0.74	0.39	0.79
Milwaukee	0.76	0.45	0.78
Oklahoma City	0.44	0.14	0.76
Pittsburgh	0.68	0.29	0.74
St. Louis	0.76	0.45	0.83
Syracuse	0.68	0.30	0.83
Wichita	0.55	0.19	0.85

All of the metrics used are defined on a normalized scale, with higher values indicating higher segregation. It is important to note, however, that even though all of these metrics have the same range in value, the scales are not necessarily equivalent. A .5 Gini Coefficient is not the same as a .5 Delta Index, for example.

According to the Gini Coefficient, the most segregated city is Baltimore (0.77), the Correlation Ratio says it’s Milwaukee (0.45), and the Delta Index shows Kansas City (0.85), as the most segregated. To better understand the variation in segregation metrics, we visualize the data:

Here we see that while the Gini Coefficient and Correlation Ratio appear to have some nontrivial degree of correlation, the Delta Index has no relation to the other two metrics. We can show that this is the case by testing the correlation of each metric:

These correlations can be attributed to the fact that the Delta index is the only index to make use of area data. Since the Gini Coefficient and Correlation Ratio rely on many of the same variables, it makes sense that they are correlated because they pull from the same data. The addition of the area data in the Delta index means it should vary differently, as it pulls from different data.

This is evident in the change in segregation ranking for each metric. Gini and Correlation have almost the same ranking, but the Delta Index is wildly different.

Note: there is no reason to start with any particular metric in the above visual, but keeping Correlation and Gini next to each other shows thier similarity.

Metric Proposal

\[\sum _{ i=1 }^{n}{ \sum _{j=1}^{n}{ \left[ \left| \frac {{p}_{i}}{{a }_{i}} -\frac {{p}_{j}}{{a}_{j}} \right| \right] }} \]

Where \(n\) is the number of regions, \({p}_{i}\) is the minority population proportion in region \(i\), and \({a}_{i}\) is the area in square meters of region \(i\).

This proposed metric measures the relative difference in minority population proportion per unit of area \(\frac {p}{a}\) between all regions \(i\). If a region has a larger percentage of minorities in a smaller area than some other region, the difference will be larger, which indicates a higher segregation. This measure has the benefit of being able to compare multiple regions against each other, allowing us to better understand changes across a city, as well as taking into consideration physical population density.

According to this new metric, the most segregated city is Chicago.

Here again we see little relation between the new metric and the Gini Coefficient and Correlation Ratio, likely because of the inclusion of area data. More interesting is how the rankings change between the Delta Index and new metric. Between these two, there is only a correlation of -0.35.