Project-3
Dec 8,2023
Hello,
We have started to work on project report. update as of today, we have finalized our issues, discussion and result section. Planning to complete remaining sections by tomorrow.
Dec 4,2023
Hello,
In this section, I conducted Time Series Decomposition for two key economic indicators: Hotel Occupancy Rates and Logan Airport Passengers. Here’s a breakdown of the process and the results:
– I created a combined date column by merging the “Year” and “Month” columns to create a datetime column. This datetime column was set as the index for the dataset, making it easier to perform time series analysis.
– I selected two economic indicators for analysis: Hotel Occupancy Rates and Logan Airport Passengers.
– For each indicator, I performed seasonal decomposition using the `seasonal_decompose` function from `statsmodels.tsa.seasonal`. I used an additive model and specified a period of 12 months, indicating that the seasonality repeats annually.
I plotted the decomposition results for both indicators, with each plot displaying four components:
1. Observed Data: This plot shows the actual values of the economic indicator over time, in this case, Hotel Occupancy Rates and Logan Airport Passengers.
2. Trend Component: The trend plot reveals the underlying trend or pattern in the data. It helps identify whether the indicator is generally increasing, decreasing, or following a specific pattern.
3. Seasonal Component: The seasonal plot displays the recurring patterns or seasonality in the data. It helps identify any regular fluctuations that occur at specific times of the year.
4. Residuals: The residuals plot represents the remaining variation in the data after removing the trend and seasonal components. It can provide insights into irregularities or unexpected changes in the data.
Formatting and Visualization:
– To enhance readability, I formatted the x-axis of the plots to display years using the `mdates` module. This makes it easier to identify trends and seasonality over time.
Interpretation:
– Time series decomposition is a valuable technique for understanding the underlying patterns and components within economic indicators. It allows us to separate the data into its constituent parts, making it easier to identify trends, seasonality, and irregularities.
– By examining the decomposition plots, we can gain insights into how Hotel Occupancy Rates and Logan Airport Passengers vary over time. This information can be crucial for making informed decisions and strategic planning in various sectors, including tourism and hospitality.
Overall, time series decomposition is a powerful tool for uncovering meaningful patterns within economic data, enabling better analysis and forecasting.
Dec 1,2023
Hello,
In this section, I conducted a clustering analysis to identify patterns within the dataset. Here’s a summary of the steps and outcomes:
To prepare the data for clustering, I started by normalizing the dataset. Normalization ensures that all variables have the same scale, which is crucial for meaningful clustering results. The `StandardScaler` from `sklearn.preprocessing` was used to standardize the data. I excluded the “Year” and “Month” columns from the normalization process.
To determine the optimal number of clusters for the K-means algorithm, I employed the Elbow Method. This method involves running K-means clustering for a range of cluster numbers (from 1 to 10) and calculating the Within-cluster Sum of Squares (WCSS) for each. The WCSS measures the variability within clusters. I plotted the results of the Elbow Method, and the point at which the decrease in WCSS starts to level off represents the optimal number of clusters.
Based on the Elbow Method analysis, I performed K-means clustering with two different cluster numbers: 3 and 4. The `KMeans` class from `sklearn.cluster` was utilized for this purpose. The models were initialized using the “k-means++” method for better convergence, and a fixed random state (random_state=42) was set for reproducibility.
After fitting the models, I assigned data points to clusters using the `fit_predict` method. Two sets of data were created, one with 3 clusters and another with 4 clusters.
I added the cluster information back to the original dataset for further analysis. This allowed me to understand which cluster each data point belonged to, providing insights into the distinct groups or patterns within the data.
To provide a glimpse of the results, I displayed the first few rows of each dataset with cluster information. This inspection offers a preliminary view of how data points are distributed among clusters.
Clustering analysis helps identify inherent structures or groups within the dataset, enabling a deeper understanding of data patterns and trends. It can be a valuable tool for segmentation and decision-making in various domains.
Nov 29,2023
Hello,
In this section, I conducted a predictive analysis to forecast median housing prices using a Linear Regression model. Here’s a breakdown of the steps and results:
I defined the independent variables (predictors) and the dependent variable (target) for the regression model. The predictors included “Total Jobs,” “Unemployment Rate,” “Hotel Occupancy Rate,” and “Logan International Flights,” while the target variable was “Median Housing Price.”
The dataset was split into training and testing sets to evaluate the model’s performance. In this case, 70% of the data was allocated for training, and 30% for testing. The random_state parameter was set to 42 for reproducibility.
I created a Linear Regression model using the `LinearRegression` class from `sklearn.linear_model`. This model is used to predict the target variable based on the predictor variables.
The model was fitted with the training data using the `fit` method. This process involved learning the relationships between the predictor variables and the target variable.
To assess the model’s performance, I made predictions on the test set using the trained model. I calculated two key performance metrics:
– Mean Squared Error (MSE): A measure of the average squared difference between actual and predicted values. A lower MSE indicates better model performance.
– R-squared (R²) Score: A measure of how well the model explains the variance in the target variable. R² ranges from 0 to 1, with higher values indicating a better fit.
The calculated performance metrics are as follows:
– MSE: [MSE Value]
– R² Score: [R² Value]
To visually assess the model’s performance, I created two plots:
– “Actual vs Predicted Median Housing Prices”: This scatter plot compares the actual median housing prices (y-axis) with the predicted prices (x-axis) for the test set. The dashed line represents perfect predictions.
– “Residuals of Predicted Median Housing Prices”: This scatter plot shows the residuals (differences between actual and predicted prices, y-axis) against the predicted prices (x-axis). The red dashed line at y=0 represents zero residuals.
These visualizations and performance metrics help evaluate the accuracy of the Linear Regression model in predicting median housing prices based on the selected economic indicators.
Nov 27,2023
Hello,
To enhance the analysis, I converted the “Year” and “Month” columns into a single datetime column named “Date.” This conversion simplifies the time-based analysis and visualization of data trends.
I then plotted the trend of the unemployment rate in Boston from January 2013 to December 2019. The line graph provides a visual representation of how the unemployment rate changed over this period. It is evident that the unemployment rate experienced fluctuations during these years.
Additionally, I calculated the correlation between the unemployment rate and median housing prices. Correlation analysis helps us understand the relationship between these two variables. In this case, the correlation value quantifies the degree to which changes in the unemployment rate are associated with changes in median housing prices. This statistical measure provides valuable insights into the potential connections between economic indicators.
Nov 24,2023
Hello,
In this phase of my analysis, I delved into correlation analysis and predictive modeling, aiming to uncover relationships between economic indicators and forecast median housing prices. Let’s break down what I’ve accomplished:
Correlation Analysis:I began by selecting a subset of economic indicators, including “Logan Passengers,” “Hotel Occupancy Rate,” “Unemployment Rate,” “Median Housing Price,” and “Housing Sales Volume.” I then created a correlation matrix to visualize the relationships between these indicators. The correlation heatmap displayed the strength and direction of these relationships, providing insights into how changes in one variable may affect others.
Predictive Modeling:
My analysis also involved predictive modeling, specifically focused on forecasting median housing prices. I identified predictor variables, which included “Unemployment Rate,” “Logan Passengers,” and “Total Jobs,” while the target variable was “Median Housing Price.” The dataset was split into training and testing sets to evaluate the model’s performance.
I applied a Linear Regression model to predict median housing prices based on the selected predictor variables. The model was trained on the training data, and predictions were made on the testing data. I assessed the model’s performance using Mean Squared Error (MSE) and R-squared (R²) as key metrics. These metrics provide insights into how well the model predicts median housing prices based on the chosen economic indicators.
In summary, this phase of the analysis involved correlation analysis to understand the relationships between economic indicators and predictive modeling to forecast median housing prices. These insights can be invaluable for decision-making in economic planning and housing market assessments.
Nov 22,2023
Hello,
In our analysis, i started by preparing the data for time series exploration. We created a new date column by combining the existing “Year” and “Month” columns and set it as the index for our DataFrame. This step was crucial for organizing the data in a time series format, enabling us to analyze how economic indicators change over time.
After the data preparation, we focused on trend analysis for several key economic indicators. These indicators included “Logan Passengers,” “Logan International Flights,” “Hotel Occupancy Rate,” “Hotel Average Daily Rate,” “Total Jobs,” “Unemployment Rate,” “Median Housing Price,” and “Housing Sales Volume.” Our objective was to visualize the trends in these economic factors over time.
I presented the trend analysis results through a series of line plots. Each plot represented one economic indicator, and the x-axis displayed time, while the y-axis represented the values of the respective indicator. This visualization allowed us to observe how these economic variables evolved over the years.
My analysis provided valuable insights into the long-term trends of these economic factors, which can be instrumental in making informed decisions related to economic planning, tourism strategies, and housing market assessments.
Nov 20, 2023
This dataset, called “the economic indicator,” has information about different economic factors, sorted by year and month. Here’s an easy explanation of what each part means:
– Year and Month: When the data was collected.
– Logan Passengers: How many people used Logan Airport.
– Logan Intl Flights: Number of international flights at Logan Airport.
– Hotel Occupancy Rate: How full hotels were.
– Hotel Average Daily Rate: Average cost per day to stay in a hotel.
– Total Jobs: The total number of jobs available.
– Unemployment Rate: The percentage of people without jobs.
– Labor Force Participation Rate: Percentage of people working or looking for work.
– Pipeline Unit: Info about housing or building projects, like how many units are there.
– Pipeline Total Development Cost: How much it costs to build these projects.
– Pipeline Square Footage: The total size of these building projects.
– Pipeline Construction Jobs: How many jobs are created for building these projects.
– Foreclosure Petitions: How many people asked for help to avoid losing their homes.
– Foreclosure Deeds: How many people actually lost their homes.
– Median Housing Price: The middle price for homes.
– Housing Sales Volume: How many houses were sold.
– New Housing Construction Permits: Number of permissions given to build new houses.
– New Affordable Housing Permits: Number of permissions for building affordable houses.
This dataset gives a good overview of different parts of the economy like air travel, hotels, jobs, real estate, and the housing market. It helps understand the financial health and trends of a specific area.
Nov 17,2023
Hi,
In todays analysis, I Compared time series models that helps you pick the best one for predicting future data. Different models work better for different kinds of data. Here’s a simple comparison:
1. ARIMA: Good for most data but not for data that changes in a pattern (like sales in different seasons). Best for regular data that goes up and down.
2. SARIMA: Like ARIMA, but better for data with seasonal patterns (like higher ice cream sales in summer). It’s a bit complicated to use.
3. Exponential Smoothing : Easy to use and good for data that has trends and patterns that repeat every year. Not great if the data changes unexpectedly.
4. VAR (Vector Autoregression): Great for when you have several types of data and want to see how they affect each other. Needs all data types to be steady and can take a lot of computer power.
5. LSTM (Long Short Term Memory): Really good for big datasets and can understand complicated patterns. Needs a lot of data and computer power to work well.
6. Prophet (by Facebook): Made for business data that’s recorded every day. It’s good at handling special days like holidays. Not as good for data that’s not daily or very messy.
In the end, the best model often comes from trying a few and seeing which one predicts the best for your specific data. By Monday we will start analyzing our dataset.
Nov 15,2023
Hello,
Today we searched the website “dataset.boston.gov” and we are planning to take dataset of economic indicators of Boston.
The primary objective of this project is to delve into the intricacies of Boston’s economy, using the economic indicators dataset spanning from January 2013 to December 2019. The goal is to gain a thorough understanding of Boston’s economic pulse, essential for informed decision-making and strategic planning.
Nov 13,2023
Hi,
In today’s class I learned about time series analysis. It’s commonly used to forecast future events based on past trends, identify patterns, and analyze the effects of certain decisions or events. There are several key components and methods in time series analysis which I learned in today’s class. those are as follows:-
- Trend Analysis: This involves identifying the underlying trend in the data, which could be increasing, decreasing, or constant over time.
- Seasonality: This refers to patterns that repeat at regular intervals, such as weekly, monthly, or yearly. Seasonality analysis helps in understanding and adjusting for these regular patterns.
- Stationarity: A time series is stationary if its statistical properties like mean, variance, and autocorrelation are constant over time. Many time series models require the data to be stationary.
- Models for Time Series Analysis: Common models include ARIMA (Autoregressive Integrated Moving Average), SARIMA (Seasonal ARIMA), and more advanced machine learning models like LSTM (Long Short-Term Memory) networks.
We also had a chance to look at the economic indicators data and learn practically about time series analysis. By next class, I will be selecting a dataset from data.boston.gov and discuss in the next class.
Project Report-2
Report Making
Hello,
We have started to work on project report. update as of today, we have finalized our issues, discussion and result section. Planning to complete remaining sections by tomorrow.
Analyzing Trends in Flee Statuses in Fatal Police Encounters
Hello!!
Today, I have performed analysis on Flee Statuses in Fatal Police Encounters . I felt important to understand how people act in these situations, especially if they try to run away.
Monthly Analysis of 2022:In 2022, the flee statuses in fatal police encounters showed varied patterns. The analysis categorizes flee statuses into four types: car, foot, not fleeing, and other. For example, in January, there were 5 incidents of fleeing by car, 19 by foot, 40 where the individual did not flee, and 1 ‘other’. Throughout the year, the ‘not fleeing’ category consistently had the highest number of incidents each month, with the numbers varying. The occurrences of fleeing by car and foot also showed fluctuations, while the ‘other’ category, though the least frequent, varied between 1 to 6 incidents per month. A detailed line chart provides a visual representation of these monthly trends, highlighting the fluctuating nature of flee statuses throughout the year.
Yearly Analysis: The yearly trend offers a broader perspective. A line chart plotting the annual data from 2015 onwards indicates that the majority of individuals in these fatal encounters did not attempt to flee, with this trend showing a slight decrease over the years. The incidents of fleeing by car and foot exhibit slight variations but remain relatively consistent annually. The ‘other’ category of fleeing remains the least common across the years.
The analysis of flee statuses in fatal police encounters highlights crucial aspects of these incidents. While most individuals did not flee, those who did showed a preference for fleeing by car or foot, with slight year-to-year change. These insights are valuable for understanding the nature of fatal police encounters and can be used in the project. I have few queries with respect to the findings, which I intend to ask the professor in the next class.
An Analytical view into Fatal Police Shootings by Agency
Hello,
Today I performed analysis on police shooting data to determine the frequency of fatal shootings involving specific police agencies.
A detailed analysis of the data from 2015 to the present provides us with a clearer picture of how these tragic events are distributed across different law enforcement agencies. For instance, the Los Angeles Police Department (LAPD) stands out with the highest number of fatal shootings, a statistic that prompts a deeper examination of the protocols and community engagements specific to the LAPD. Below Bar chart showing the top 10 agencies by total fatal police shootings
The aspect of police shootings is the intersection with mental health. The analysis reveals significant variation in the proportion of incidents related to mental illness among different agencies. The Las Vegas Metropolitan Police Department, for example, shows a notably higher rate of mental illness-related incidents, which could reflect a broader narrative on the challenges police face when encountering individuals in mental health crises. The second graph is a bar chart that illustrates the top 10 police agencies by the percentage of mental illness-related shootings. This visual complements the second paragraph, which delves into the complexities of police interactions with individuals experiencing mental health crises.
For the final visualization to match the third paragraph, I created a scatter plot that examines the relationship between the use of body cameras and the frequency of shootings by different agencies.
The third graph is a scatter plot that explores the correlation between the use of body cameras and the total number of fatal police shootings by agency. This visualization supports the third paragraph, highlighting how the deployment of body cameras might influence police interactions and the transparency of such critical incidents.
Together, the analysis and graphs provide a data-driven narrative about police practices and the factors influencing fatal police encounters.
I have few queries with respect to the findings, which I intend to ask the professor in the next class.
Exploring Racial Differences in Fatal Police Shootings
Hello,
In today’s analysis I performed a heat map analysis for different racial groups, to gain insights into how various factors are interrelated in incidents of fatal police shootings, with a particular focus on how these relationships vary across different racial groups. My analysis are as follows:-
- Asian (A)
– There is a positive correlation between latitude and longitude, suggesting that incidents involving Asian individuals tend to occur in specific geographical regions.
– The age variable shows very little correlation with other variables, indicating that age does not play a significant role in these incidents for Asian individuals.
– The was_mental_illness_related variable shows a slight negative correlation with body_camera, suggesting that incidents involving mental illness are less likely to be recorded by body cameras. - White (W)
– Similar to the Asian group, there is a positive correlation between latitude and longitude.
– The age variable shows a slight negative correlation with latitude and longitude, suggesting that incidents involving older white individuals might occur in different geographical areas compared to younger individuals.
– The was_mental_illness_related variable has very little correlation with other variables.
Hispanic (H)
– Again, there is a positive correlation between latitude and longitude.
– The age variable shows a slight negative correlation with latitude, longitude, and was_mental_illness_related.
– The body_camera variable shows very little correlation with other variables.
Black (B)
– The positive correlation between latitude and longitude is present, though slightly weaker compared to other racial groups.
– The age variable shows a slight negative correlation with latitude, longitude, and was_mental_illness_related.
– The body_camera variable shows a slight positive correlation with was_mental_illness_related.
Other (O)
– The positive correlation between latitude and longitude is weaker compared to other racial groups.
– The age variable shows very little correlation with other variables.
– The was_mental_illness_related variable shows a slight negative correlation with body_camera.
Native American (N)
– The correlation between latitude and longitude is weaker compared to other racial groups.
– The age variable shows very little correlation with other variables.
– The body_camera variable shows a slight negative correlation with was_mental_illness_related.
Black and Hispanic (B;H)
– This group has a very limited number of data points, and as such, the correlations should be interpreted with caution.
– The age variable shows a slight negative correlation with latitude and longitude.
– The was_mental_illness_related variable shows a slight positive correlation with body_camera.
General Observations
– Across all racial groups, there is a consistent positive correlation between latitude and longitude.
– The age variable generally shows little to no correlation with other variables.
– The relationship between was_mental_illness_related and body_camera varies across racial groups, indicating potential areas for further investigation.
These observations provide a starting point for further analysis and discussion. It is important to approach these findings with a critical eye and consider additional factors and context that might influence these relationships.
Predicting Gender with Logistic Regression
In this analysis, I explored a dataset of fatal police shootings in the United States to predict the gender of individuals involved based on various features. I used logistic regression model and gain insights gained from the datasets.
I focused on categorical variables such as threat_type, flee_status, armed_with, race, and gender. I divided the data into training and testing sets, with 80% of the data used for training and 20% for testing.
The model achieved an accuracy of approximately 95.19% which is very good. Further I created a bar chart to visualize the coefficients of the logistic regression model used in this analysis.
This visualization provides a clear view of how each feature influences the prediction of gender in the context of fatal police shootings. Further I generated a confusion matrix, to describe the performance of a classification algorithm. It summarizes the number of correct and incorrect predictions, broken down by each class.
My analysis from the matrix are as follows:-
- Top-left cell (True Negative): The number of actual females correctly predicted as females. In this case, the count is 0, indicating that the model failed to correctly predict any of the female samples.
- Top-right cell (False Positive): The number of actual females incorrectly predicted as males. The count is 61, showing that all the females in the test set were misclassified as males.
- Bottom-left cell (False Negative): The number of actual males incorrectly predicted as females. The count is 0, indicating that there were no males misclassified as females.
- Bottom-right cell (True Positive): The number of actual males correctly predicted as males. The count is 1207, showing that the model was very effective at identifying the male samples.
- In the end I felt that the model is more biased towards male when compared with female.
Further, I am looking to ask my questions and concerns with the professor/ta in the next class.
A Deep Dive into California, Texas, and Florida
Hi,
In this analysis, I have focused on three major states in the U.S.: California (CA), Texas (TX), and Florida (FL). I applied a technique called K-Means clustering to the data, which helped us group the incidents into four distinct spatial clusters for each state. This approach allows us to see areas with higher concentrations of these unfortunate events.
I observed the following:-
- Each state has its own unique pattern. While some clusters are densely packed in urban areas, others spread out in more rural regions.
- The scatter plots reveal that incidents are not evenly distributed but rather concentrate in certain areas.
Breaking the clusters state wise:-
- California (CA)
Cluster 0:This is the most significant cluster with 362 incidents. It mainly captures the dense urban areas of the state.
Cluster 1: With 145 incidents, this cluster represents a mix of urban and suburban areas.
Cluster 2:*This is one of the smaller clusters with 89 incidents, indicating less frequent occurrences in these regions.
Cluster 3:Comprising 239 incidents, this cluster spans several urban zones.
2. Texas (TX):
Cluster 0:The largest cluster with 266 incidents, capturing major cities and their surroundings.
Cluster 1:This cluster represents 115 incidents, predominantly in the eastern part of the state.
Cluster 2: With 179 incidents, this cluster spreads across the central regions.
Cluster 3:This is the smallest cluster in Texas with 67 incidents.
3. Florida (FL):
Cluster 0: Representing 104 incidents, this cluster is situated around the northern part.
Cluster 1:The largest in Florida, this cluster with 249 incidents covers the southern tip, including areas around Miami.
Cluster 2:This is the smallest cluster with only 17 incidents.
Cluster 3: With 89 incidents, it captures the central regions of the state.
And I intend to discuss my concerns and questions with the professor in the next class.
EDA on Armed Status
Hi, I analyzed the distribution of different armed statuses over number of incidents.
I observed that the majority of fatal police shootings involve individuals armed with guns, with the number significantly higher than any other category. Knives are the second most common item armed with during fatal police shootings.
Then to gain a better understanding, I plotted a graph between armed status and threat type for the top armed statuses.
I analyzed that People with guns mainly “shoot” or “point” them. Those with knives often “threaten” or “attack” without actively using the knife like a gun. Vehicles are mostly used to “attack”, implying they’re used as weapons. Unarmed individuals often show “attack” or “move” behaviors, suggesting their actions, not weapons, are seen as threats.
Then I analyzed the Relationship between Armed Status and Flee Status.
The observations which I analyzed were , Most people in the incidents didn’t try to flee. People with guns often fled by car or foot. Those with knives mainly stayed, but if they fled, it was on foot. Many unarmed individuals tried to escape, either on foot or by car. Those using vehicles as weapons often used them to flee as well.
Next I am planning to find any co relation in between the columns as well develop a strategy to fill out the missing values.I’m also planning to share my observations and issues with the professor during our upcoming class.
EDA on Age Distributions and Proportions
In today’s analysis, I looked into other variables, such as the distribution by gender, race, or factors like if the individual was armed, fleeing, or if it was a mental illness-related incident.
I observed while analyzing the pie chart that the proportions of age groups of individuals killed by the police are of age groups 21-30 and 31-4, have the largest proportions
Then I explored the distribution of some other factors by Gender Distribution by age group.
I observed that the number of males killed by the police is significantly higher than females across all age groups. The highest count of males is observed in the age group 31-40, followed closely by the 21-30 age group.
While analyzing It in perctange, I noticed that across all age groups, the male percentage is overwhelmingly higher than the female percentage. The 0-10 age group is the only one where the gender distribution is balanced, with both males and females having an equal 50% distribution. The 61-70 age group has the highest male percentage (97.23%), while the female percentage is the lowest in this group. The female percentage seems slightly higher in the older age groups (71-80 and 81+) compared to the middle age groups, but it’s still significantly lower than the male percentage.
This analysis confirms that a significant majority of individuals killed by the police across all age groups are male.
Further I planed to analyze the distributions over Race Distribution, Armed. status, flee status, mental illness relation by age group.
Geopy Data Visualization
After a comprehensive exploratory data analysis, I decided to create an interactive map using the data points given in the dataset, i.e, latitude and longitude.
After closely analyzing I discovered the following points:-
- High Concentration Areas: The map shows a higher concentration of police shootings in certain urban areas. This could be due to higher population densities or higher crime rates in these regions.
- Rural areas and certain states seem to have fewer incidents of police shootings. Factors could include lower population densities or fewer incidents requiring police intervention.
- Some states seem to have a higher number of shootings relative to their size and population. A deeper analysis comparing the number of shootings with the state’s population could provide insights into states with high or low numbers of incidents.
- Major cities seem to have a higher number of shooting incidents. This correlation might be due to a combination of factors, including higher population density, increased police presence, or socio-economic factors.
Further Analysis:
As advised by the professor, I would be downloading the data from the census website and merge the 2 sheets based on county level. Which I feel would be much appropriate for in depth analysis.
EDA on Dataset
Today I have done an in depth EDA on the Washington shooting dataset which contains 8,770 records with 19 columns.
The dataset contains a lot of missing data in several variables, particularly in “County” (55.38%) and demographic-related variables like “Race” (16.18%). Other variables such as “Flee Status” and location-related i.e, “Longitude” and “Latitude” also exhibit missing entries, that will affect geographical analyses.
After I moved towards demographic analysis. During the incidents , they were a lot of male individuals and primarily between the ages of 20 and 40, with a slight skew towards younger ages. Regarding race, White individuals were most frequently involved, followed by Black and Hispanic individuals.
While analyzing the geography, I plotted a bar chart to understand the trend of number of incidents in each state, and I observed that the incidents are not uniformly distributed across states.
California (1235 incidents), Texas (807 incidents), and Florida (559 incidents) have notably higher incidents compared to other states.
I am planning to conduct further detailed analysis and identify any correlations between these variables..Also, I am planning to ask my questions to the professor in the upcoming class.
October 11, 2023
Today we have started our project-2 i.e., Washington Post data repository on fatal police shootings in the United States. This data has records starting from January 2, 2015. The data gets updated weekly, I did basic analysis and found that they are many missing values. While doing some basic analysis, I discovered that they are many missing values in some parameter’s, such as flee, age, race. I am still in dilemma on how to address these missing values and procced with the project. In addition, it also depends on our end goal on which model to use on the basis on our findings.
In the coming days, I would be doing a comprehensive analysis, to find any relationship in between the parameters to get a better picture
Project Report -1
October 2, 2023
In the current analysis, I attempted to construct a model based on equation i., aiming to predict inactivity and obesity using various parameters such as physical environment, transport, economics, and food access. However, I am encountering an issue where the R2values are lower than the R2 value of the equation:
Moreover, there are ongoing issues with the code that I am in the process of debugging. On a separate note, since this is an experimental project, I plan to employ Weighted Least Squares (WLS) to finalize my project and explore whether I can enhance the model’s accuracy to successfully complete the project.
Sep 29,2023
We aim to assess the factors influencing the percentage of diabetics using the model:
Y = β0 + β1X1 + β2X2
- denotes the percentage of individuals with diabetes.
- X1represents the percentage of people who are inactive.
- X2 signifies the percentage of obesity.
We’re relying on data from the CDC website, which offers insights into the Social Determinants of Health. These determinants can act as indicators for diabetes risk factors. Specifically, we’re focusing on four variables: physical environment, transport, economics, and food access.
It’s evident that these variables are interrelated. For instance:
- Inactivity correlates with both the physical environment and transportation.
- Obesity is influenced by economic conditions and food accessibility.
Considering these relationships, we can outline the following equations:
- For diabetes: y= (for obesity) + X2β2 (for inactivity).
- For inactivit (for physical environment) + X12β12 (for transport).
- For obesity: (for economics) + X22β22 (for food access).
To optimize our analysis, we’ll structure it into three models. While the first model has been developed, the other two will be constructed to support the primary model. Afterward, we’ll compare the results from all three models.
27 September, 2023
In the previous analysis, I conducted 5 fold cross validation with R2 scores ranging from -0.0598 to 0.4617.
I conducted 10 fold cross validation to check how my model is going to perform, whether the model efficiency is going to increase or decrease. Results are as follows:-
5-Fold Cross-Validation:
- R2 values: [0.462, 0.020, -0.060, -0.059, 0.411]
- Mean Absolute Error (MAE) values: [-0.588, -0.509, -0.460, -0.164, -0.368]
- Root Mean Squared Error (RMSE) values: [0.773, 0.706, 0.610, 0.234, 0.593]
10-Fold Cross-Validation:
- R2values: [0.402, 0.348, 0.338, -0.337, -0.077, 0.268, 0.024, -0.118, 0.060, 0.423]
- Mean Absolute Error (MAE) values: [-0.566, -0.567, -0.417, -0.617, -0.621, -0.259, -0.150, -0.177, -0.159, -0.587]
- Root Mean Squared Error (RMSE) values: [0.766, 0.739, 0.553, 0.848, 0.761, 0.344, 0.198, 0.266, 0.235, 0.809]
My interpretation for 5 and 10 fold cross validation are as follows:-
- (Coefficient of Determination): It’s a measure of how well the variations in the predicted values are explained by the model. A greater R2 is generally better. In our results, both 5-fold and 10-fold cross-validation have some negative R2 values, which indicates that the models could be worse. The 10-fold seems to have slightly more consistent R2 values, but it’s essential to ensure that the model doesn’t overfit
- Mean Absolute Error (MAE): It measures the average of the absolute differences between the predicted and actual values. A lower MAE indicates better model performance. The MAEs from 10-fold CV are slightly more consistent than 5-fold.
- Root Mean Squared Error (RMSE): It measures the square root of the average of the squared differences between the predicted and actual values. A lower RMSE indicates better model performance. The RMSE values from 10-fold CV are relatively consistent.
I believe the 10-fold CV provides more consistent results in terms of R2, MAE, and RMSE. In addition the choice between 5-fold and 10-fold (or any other k) is often based on specific project needs, dataset size.
Cross Validation, Sep 25-2023
What is Cross Validation?
Cross-validation is a resampling method that uses different portions of the data to test and train a model on different iterations.
I applied cross validation in our project using python,
Detailed analysis based on the 5-fold cross-validation results
- Variability in R2Values:-
- The R2 scores for the 5-fold cross-validation range from -0.0598 to 0.4617.
- Only two of the folds resulted in an R2 value above 0.4, which is a moderate explanatory power. The other three folds had values close to zero or slightly negative.
- Negative R2values in two of the folds indicate that the model’s predictions were worse than just predicting the mean of the target variable for those particular data splits.
Mean Absolute Error (MAE):
- The MAE values range from 0.1641 to 0.5881 (ignoring the negative sign, which is due to the scoring convention).
This means that, on average, the model’s predictions can deviate from the actual values by this amount. The model seems to have a higher error in some folds compared to others.
Mean Squared Error (MSE) and Root Mean Squared Error (RMSE):
-
- The RMSE values for the 5-fold cross-validation range from 0.2342 to 0.7728.
- The RMSE is particularly useful because it gives an idea of the size of the error in the same units as the target variable. An RMSE of 0.7728 means that the model’s predictions can be off by about 0.7728% (in terms of diabetic percentage) on average, in the worst-performing fold.
The variability in performance across the 5 folds suggests that the dataset might contain regions where the linear relationship between the features and the target variable isn’t strong.The presence of negative R2 values in two of the folds indicates regions where the linear model doesn’t fit the data well.
Friday, Sep 22,2023
Update regarding the project: –
To begin with, I solved the issues which were there in the code for linear regression. The stastical parameters and the graph are as follows:
Linear regression graph: –
The visualizations show the relationship between the independent variables (“% INACTIVE” and “% OBESE”) and the dependent variable (“% DIABETIC”) for the test data. In each plot:
- The blue points represent the actual “% DIABETIC” values.
- The red points represent the predicted “% DIABETIC” values based on the linear regression model.
Key Metrics for the Model:
- Mean Squared Error (MSE). Value: -0.400063
This represents the average of the squares of the errors between the predicted and actual values. Lower values are better, but the scale depends on the dependent variable.
- R-squared. Value-0.395
This represents the proportion of the variance for the dependent variable that’s explained by the independent variables in the model. The \( R^2 \) value ranges from 0 to 1, with higher values indicating a better fit. An \( R^2 \) value of 0.395 means that the model explains approximately 39.5% of the variability in “% DIABETIC”.
Interpretation:
- The \(R^2 \) value of 0.395 suggests that the model explains about 39.5% of the variance in the “% DIABETIC” variable, which is a moderate level of explanation.
- The MSE of 0.400 is a measure of the model’s prediction error. Lower values are generally better.
- The model efficiency is 39.5%, which I feel is not too great, but this is what could be achieved with the following data points.
- To increase the model efficiency, we can do WLS. But still not sure how to implement it. Going to ask on Mondays class.
I would be trying to find a relationship with other parameters which are available on the website. I have considered Housing cost burden as a parameter to experiment with obesity.
Weighted Least Squares
T test: –
In simple terms, T test is a statistical test that is used to compare the means of two groups.
The t-test is not applicable in our Project 1, as it involves three variables, and our project is designed for comparisons between two variables
What is WLS?
With WLS, points of data have varied values with the objective to maximize the model fitting process by giving greater weight to reliable observations. Applying WLS improves the model of regression by including the weighted significance of each data item into consideration. In order to improve accuracy, the WLS output adjusts the model through taking into account the effects of each data point through set weights.
During the upcoming class, I plan to engage with the professor and TA to seek advice, how to implement WLS in the project.
September 18,2023
Today, I learned about regression analysis involving two variables.
The multiple regression formula is represented as
Y = β0 + β1X1 + β2X2…
In this context, Y represents the percentage of diabetics, X1 stands for the percentage of inactivity, and X2 indicates the percentage of obesity.
How did I use Linear regression in the project?
I started off by importing data from an Excel file, using the panda’s library in visual studio, I was able to efficiently process and analyze the data.
I then used linear regression to see the potential relationship between %diabetes and %obesity, as well as between %diabetes and %inactivity. I also generated smooth histograms for both %diabetes and %obesity. These were accompanied by some key statistical metrics, such as mean, median, skewness, and kurtosis, offering a deeper understanding of the data’s distribution and features.
September 15, 2023
The simple linear regression plot failed to reveal a lot of details regarding the connection among obesity and inactivity.
According to me, the aim is to figure out if diabetes is affected by obesity and inactivity. While using linear regression, I kept 2 independent variables which are obesity and inactivity. While plotting the graph I found that they are many outliers and could not find much relation.
I am still debugging the errors and still having questions in what actually we need to find from the dataset.
Importance of P Value in Stats
P-values are a widely used concept in stats as well as scientific study, however for those who are completely new with methods of statistics, they can be very puzzling.
So, what are p values?
P-value, often known as “probability value,” is a numerical measure of the chances that a particular event would occur by coincidence.
So, why is P-value important?
It supports us to figure out whether the patterns in the data are most likely caused by a true cause rather than by chance. A low P-value, often less than 0.05, denotes the likelihood that our findings are not random. This assures us that we are on to something important.
In conclusion, what I understood is if the p-value is significantly smaller than the null hypothesis will be rejected.
In the project as I will be using linear regression, p value plays an important role. There may be chances when I calculated the p value using the dataset taking various parameter’s such as inactivity, obesity. I am not sure how am I going to implement it. I will be asking the T.A or the professor on Fridays doubt session.
September 11,2023
While the data is structured and lacks duplicate values, there are numerous other factors to consider when examining variables related to conditions like diabetes and obesity. For instance, if the weather in a particular county or state is excessively cold compared to others, people may increase their food consumption for survival. Furthermore, if a county is located within a state where fast food consumption is prevalent, the likelihood of individuals experiencing inactivity, obesity, and diabetes is significantly higher.
My strategy for tackling this project involves several steps. First, I plan to divide the counties based on their respective states and categorize them into either northern or southern regions. This division will provide valuable insights into why certain states or counties exhibit higher rates of diabetes, obesity, or inactivity.
My initial focus will be on inactivity and obesity, as I believe that inactivity often leads to obesity, which in turn can increase the risk of diabetes. To facilitate this analysis, I have organized the counties using the Federal Information Processing Standard (FIPS) codes, making it easier to group and study the data. Additionally, I have used Python to compute various statistical parameters such as mean, median, and standard variation to gain a deeper understanding of the data’s characteristics.
In conclusion, I plan to collaborate with Dr. Dylan George to determine the specific findings he requires from our data analysis. However, I have some uncertainty regarding the application of Heteroscedasticity using Python, and I intend to seek clarification from my instructors during class discussions.