Predicting Gender with Logistic Regression

In this analysis, I explored a dataset of fatal police shootings in the United States to predict the gender of individuals involved based on various features. I used logistic regression model and gain insights gained from the datasets.

I focused on categorical variables such as threat_type, flee_status, armed_with, race, and gender. I divided the data into training and testing sets, with 80% of the data used for training and 20% for testing.

The model achieved an accuracy of approximately 95.19% which is very good. Further I created a bar chart to visualize the coefficients of the logistic regression model used in this analysis.

This visualization provides a clear view of how each feature influences the prediction of gender in the context of fatal police shootings. Further I generated a confusion matrix, to describe the performance of a classification algorithm. It summarizes the number of correct and incorrect predictions, broken down by each class.

My analysis from the matrix are as follows:-

  • Top-left cell (True Negative): The number of actual females correctly predicted as females. In this case, the count is 0, indicating that the model failed to correctly predict any of the female samples.
  • Top-right cell (False Positive): The number of actual females incorrectly predicted as males. The count is 61, showing that all the females in the test set were misclassified as males.
  • Bottom-left cell (False Negative): The number of actual males incorrectly predicted as females. The count is 0, indicating that there were no males misclassified as females.
  • Bottom-right cell (True Positive): The number of actual males correctly predicted as males. The count is 1207, showing that the model was very effective at identifying the male samples.
  • In the end I felt that the model is more biased towards male when compared with female.

Further, I am looking  to ask my questions and concerns with the professor/ta in the next class. 

 

A Deep Dive into California, Texas, and Florida

Hi,

In this analysis, I have focused on three major states in the U.S.: California (CA), Texas (TX), and Florida (FL). I  applied a technique called K-Means clustering to the data, which helped us group the incidents into four distinct spatial clusters for each state. This approach allows us to see areas with higher concentrations of these unfortunate events.

I observed the following:-

  • Each state has its own unique pattern. While some clusters are densely packed in urban areas, others spread out in more rural regions.
  • The scatter plots reveal that incidents are not evenly distributed but rather concentrate in certain areas.

Breaking the clusters state wise:-

  1. California (CA)
    Cluster 0:This is the most significant cluster with 362 incidents. It mainly captures the dense urban areas of the state.
    Cluster 1: With 145 incidents, this cluster represents a mix of urban and suburban areas.
    Cluster 2:*This is one of the smaller clusters with 89 incidents, indicating less frequent occurrences in these regions.
    Cluster 3:Comprising 239 incidents, this cluster spans several urban zones.

2. Texas (TX):
Cluster 0:The largest cluster with 266 incidents, capturing major cities and their surroundings.
Cluster 1:This cluster represents 115 incidents, predominantly in the eastern part of the state.
Cluster 2: With 179 incidents, this cluster spreads across the central regions.
Cluster 3:This is the smallest cluster in Texas with 67 incidents.

3. Florida (FL):
Cluster 0: Representing 104 incidents, this cluster is situated around the northern part.
Cluster 1:The largest in Florida, this cluster with 249 incidents covers the southern tip, including areas around Miami.
Cluster 2:This is the smallest cluster with only 17 incidents.
Cluster 3: With 89 incidents, it captures the central regions of the state.

And I intend to discuss my concerns and questions with the professor in the next class. 

EDA on Armed Status

Hi, I analyzed the distribution of different armed statuses over number of incidents.

I observed that the majority of fatal police shootings involve individuals armed with guns, with the number significantly higher than any other category.  Knives are the second most common item armed with during fatal police shootings.

Then to gain a better understanding, I plotted a graph between armed status and threat type for the top armed statuses.

I analyzed that People with guns mainly “shoot” or “point” them. Those with knives often “threaten” or “attack” without actively using the knife like a gun. Vehicles are mostly used to “attack”, implying they’re used as weapons. Unarmed individuals often show “attack” or “move” behaviors, suggesting their actions, not weapons, are seen as threats.

Then I analyzed the Relationship between Armed Status and Flee Status.

The observations which I analyzed were , Most people in the incidents didn’t try to flee. People with guns often fled by car or foot. Those with knives mainly stayed, but if they fled, it was on foot. Many unarmed individuals tried to escape, either on foot or by car. Those using vehicles as weapons often used them to flee as well.

Next I am planning to find any co relation in between the columns as well develop a strategy to fill out the missing values.I’m also  planning to share my observations and issues with the professor during our upcoming class.

EDA on Age Distributions and Proportions

In today’s analysis, I looked into other variables, such as the distribution by gender, race, or factors like if the individual was armed, fleeing, or if it was a mental illness-related incident.

I observed while analyzing the  pie chart that the proportions of age groups of individuals killed by the police are of age groups 21-30 and 31-4,  have the largest proportions

Then I explored the distribution of some other factors by Gender Distribution by age group.

I observed that the number of males killed by the police is significantly higher than females across all age groups. The highest count of males is observed in the age group 31-40, followed closely by the 21-30 age group.

While analyzing It in perctange, I noticed that across all age groups, the male percentage is overwhelmingly higher than the female percentage. The 0-10 age group is the only one where the gender distribution is balanced, with both males and females having an equal 50% distribution. The 61-70 age group has the highest male percentage (97.23%), while the female percentage is the lowest in this group. The female percentage seems slightly higher in the older age groups (71-80 and 81+) compared to the middle age groups, but it’s still significantly lower than the male percentage.

This analysis confirms that a significant majority of individuals killed by the police across all age groups are male.

Further I planed to analyze the distributions over Race Distribution, Armed. status, flee status, mental illness relation by age group.

Geopy Data Visualization

After a comprehensive exploratory data analysis, I decided to create an interactive map using the data points given in the dataset, i.e, latitude and longitude.

After closely analyzing I discovered the following points:-

  1. High Concentration Areas: The map shows a higher concentration of police shootings in certain urban areas. This could be due to higher population densities or higher crime rates in these regions.
  2. Rural areas and certain states seem to have fewer incidents of police shootings. Factors could include lower population densities or fewer incidents requiring police intervention.
  3. Some states seem to have a higher number of shootings relative to their size and population. A deeper analysis comparing the number of shootings with the state’s population could provide insights into states with  high or low numbers of incidents.
  4. Major cities seem to have a higher number of shooting incidents. This correlation might be due to a combination of factors, including higher population density, increased police presence, or socio-economic factors.

Further Analysis:

As advised by the professor, I would be downloading the data from the census website and merge the 2 sheets based on county level. Which I feel would be much appropriate for in depth analysis.

EDA on Dataset

Today I have done an in depth EDA on the Washington shooting dataset which contains 8,770 records with 19 columns.

The dataset contains a lot of missing data in several variables, particularly in “County” (55.38%) and demographic-related variables like “Race” (16.18%). Other variables such as “Flee Status” and location-related i.e, “Longitude” and “Latitude” also exhibit missing entries, that will affect geographical analyses.

After I moved towards demographic analysis. During the incidents , they were a lot of male individuals and primarily between the ages of 20 and 40, with a slight skew towards younger ages. Regarding race, White individuals were most frequently involved, followed by Black and Hispanic individuals.

While analyzing the geography, I plotted a bar chart to understand the trend of number of incidents in each state, and I  observed that the incidents are not uniformly distributed across states.

California (1235 incidents), Texas (807 incidents), and Florida (559 incidents) have notably higher incidents compared to other states.

I am planning to conduct further detailed analysis and  identify any correlations between these variables..Also, I am planning to ask my questions to the professor in the upcoming class. 

October 11, 2023

Today we have started our project-2 i.e., Washington Post data repository on fatal police shootings in the United States. This data has records starting from January 2, 2015. The data gets updated weekly, I did basic analysis and found that they are many missing values. While doing some basic analysis, I discovered that they are many missing values in some parameter’s, such as flee, age, race. I am still in dilemma on how to address these missing values and procced with the project. In addition, it also depends on our end goal on which model to use on the basis on our findings.

In the coming days, I would be doing a comprehensive analysis, to find any relationship in between the parameters  to get a better picture

October 2, 2023

In the current analysis, I attempted to construct a model based on equation i., aiming to predict inactivity and obesity using various parameters such as physical environment, transport, economics, and food access. However, I am encountering an issue where the  R2values are lower than the  R2 value of the equation:

Moreover, there are ongoing issues with the code that I am in the process of debugging. On a separate note, since this is an experimental project, I plan to employ Weighted Least Squares (WLS) to finalize my project and explore whether I can enhance the model’s accuracy to successfully complete the project.