Sep 29,2023

We aim to assess the factors influencing the percentage of diabetics using the model:

Y = β0 + β1X1 + β2X2

  • denotes the percentage of individuals with diabetes.
  • X1represents the percentage of people who are inactive.
  • X2 signifies the percentage of obesity.

We’re relying on data from the CDC website, which offers insights into the Social Determinants of Health. These determinants can act as indicators for diabetes risk factors. Specifically, we’re focusing on four variables: physical environment, transport, economics, and food access.

It’s evident that these variables are interrelated. For instance:

  1. Inactivity correlates with both the physical environment and transportation.
  2. Obesity is influenced by economic conditions and food accessibility.

Considering these relationships, we can outline the following equations:

  1. For diabetes: y= (for obesity) + X2β2 (for inactivity).
  2. For inactivit (for physical environment) + X12β12 (for transport).
  3. For obesity: (for economics) + X22β22 (for food access).

To optimize our analysis, we’ll structure it into three models. While the first model has been developed, the other two will be constructed to support the primary model. Afterward, we’ll compare the results from all three models.

27 September, 2023

In the previous analysis, I conducted 5 fold cross validation with R2 scores ranging from -0.0598 to 0.4617.

I conducted 10 fold cross validation to check how my model is going to perform, whether the model efficiency is going to increase or decrease. Results are as follows:-

5-Fold Cross-Validation:

  • R2 values: [0.462, 0.020, -0.060, -0.059, 0.411]
  • Mean Absolute Error (MAE) values: [-0.588, -0.509, -0.460, -0.164, -0.368]
  • Root Mean Squared Error (RMSE) values: [0.773, 0.706, 0.610, 0.234, 0.593]

10-Fold Cross-Validation:

  • R2values: [0.402, 0.348, 0.338, -0.337, -0.077, 0.268, 0.024, -0.118, 0.060, 0.423]
  • Mean Absolute Error (MAE) values: [-0.566, -0.567, -0.417, -0.617, -0.621, -0.259, -0.150, -0.177, -0.159, -0.587]
  • Root Mean Squared Error (RMSE) values: [0.766, 0.739, 0.553, 0.848, 0.761, 0.344, 0.198, 0.266, 0.235, 0.809]

My interpretation for 5 and 10 fold cross validation are as follows:-

  • (Coefficient of Determination): It’s a measure of how well the variations in the predicted values are explained by the model. A greater  R2 is generally better. In our results, both 5-fold and 10-fold cross-validation have some negative R2 values, which indicates that the models could be worse. The 10-fold seems to have slightly more consistent R2 values, but it’s essential to ensure that the model doesn’t overfit
  • Mean Absolute Error (MAE): It measures the average of the absolute differences between the predicted and actual values. A lower MAE indicates better model performance. The MAEs from 10-fold CV are slightly more consistent than 5-fold.
  • Root Mean Squared Error (RMSE): It measures the square root of the average of the squared differences between the predicted and actual values. A lower RMSE indicates better model performance. The RMSE values from 10-fold CV are relatively consistent.

I believe the 10-fold CV provides more consistent results in terms of R2, MAE, and RMSE. In addition the choice between 5-fold and 10-fold (or any other k) is often based on specific project needs, dataset size.

Cross Validation, Sep 25-2023

What is Cross Validation?

Cross-validation is a resampling method that uses different portions of the data to test and train a model on different iterations.

I applied cross validation in our project using python,

Detailed analysis based on the 5-fold cross-validation results

  1. Variability in R2Values:-
  • The R2 scores for the 5-fold cross-validation range from -0.0598 to 0.4617.
  • Only two of the folds resulted in an R2 value above 0.4, which is a moderate explanatory power. The other three folds had values close to zero or slightly negative.
  • Negative R2values in two of the folds indicate that the model’s predictions were worse than just predicting the mean of the target variable for those particular data splits.

Mean Absolute Error (MAE):

  • The MAE values range from 0.1641 to 0.5881 (ignoring the negative sign, which is due to the scoring convention).

This means that, on average, the model’s predictions can deviate from the actual values by this amount. The model seems to have a higher error in some folds compared to others.

 

Mean Squared Error (MSE) and Root Mean Squared Error (RMSE):

    • The RMSE values for the 5-fold cross-validation range from 0.2342 to 0.7728.
    • The RMSE is particularly useful because it gives an idea of the size of the error in the same units as the target variable. An RMSE of 0.7728 means that the model’s predictions can be off by about 0.7728% (in terms of diabetic percentage) on average, in the worst-performing fold.

The variability in performance across the 5 folds suggests that the dataset might contain regions where the linear relationship between the features and the target variable isn’t strong.The presence of negative R2  values in two of the folds indicates regions where the linear model doesn’t fit the data well.

 

Friday, Sep 22,2023

Update regarding the project: –

To begin with, I solved the issues which were there in the code for linear regression. The stastical parameters and the graph are as follows:

Linear regression graph: –

The visualizations show the relationship between the independent variables (“% INACTIVE” and “% OBESE”) and the dependent variable (“% DIABETIC”) for the test data. In each plot:

  • The blue points represent the actual “% DIABETIC” values.
  • The red points represent the predicted “% DIABETIC” values based on the linear regression model.

 

Key Metrics for the Model:

  1. Mean Squared Error (MSE). Value: -0.400063

This represents the average of the squares of the errors between the predicted and actual values. Lower values are better, but the scale depends on the dependent variable.

  1. R-squared. Value-0.395

This represents the proportion of the variance for the dependent variable that’s explained by the independent variables in the model. The \( R^2 \) value ranges from 0 to 1, with higher values indicating a better fit. An \( R^2 \) value of 0.395 means that the model explains approximately 39.5% of the variability in “% DIABETIC”.

Interpretation:

  • The \(R^2 \) value of 0.395 suggests that the model explains about 39.5% of the variance in the “% DIABETIC” variable, which is a moderate level of explanation.
  • The MSE of 0.400 is a measure of the model’s prediction error. Lower values are generally better.
  • The model efficiency is 39.5%, which I feel is not too great, but this is what could be achieved with the following data points.
  • To increase the model efficiency, we can do WLS. But still not sure how to implement it. Going to ask on Mondays class.

I would be trying to find a relationship with other parameters which are available on the website. I have considered Housing cost burden as a parameter to experiment with obesity.

Weighted Least Squares

T test: –

In simple terms, T test is a statistical test that is used to compare the means of two groups.

The t-test is not applicable in our Project 1, as it involves three variables, and our project is designed for comparisons between two variables

What is WLS?

With WLS, points of data have varied values with the objective to maximize the model fitting process by giving greater weight to reliable observations. Applying WLS improves the model of regression by including the weighted significance of each data item into consideration.  In order to improve accuracy, the WLS output adjusts the model through taking into account the effects of each data point through set weights.

During the upcoming class, I plan to engage with the professor and TA to seek advice, how to implement WLS in the project.

September 18,2023

Today, I learned about regression analysis involving two variables.

The multiple regression formula is represented as

Y = β0 + β1X1 + β2X2…

In this context, Y represents the percentage of diabetics, X1 stands for the percentage of inactivity, and X2 indicates the percentage of obesity.

How did I use Linear regression in the project?

I started off by importing data from an Excel file, using the panda’s library in visual studio, I was able to efficiently process and analyze the data.

I then used linear regression to see the potential relationship between %diabetes and %obesity, as well as between %diabetes and %inactivity. I also generated smooth histograms for both %diabetes and %obesity. These were accompanied by some key statistical metrics, such as mean, median, skewness, and kurtosis, offering a deeper understanding of the data’s distribution and features.

September 15, 2023

The simple linear regression plot failed to reveal a lot of details regarding the connection among obesity and inactivity.

According to me, the aim is to figure out if diabetes is affected by obesity and inactivity. While using linear regression, I kept 2 independent variables which are obesity and inactivity. While plotting the graph I found that they are many outliers and could not find much relation.

I am still debugging the errors and still having questions in what actually we need to find from the dataset.

Importance of P Value in Stats

P-values are a widely used concept in stats as well as scientific study, however for those who are completely new with methods of statistics, they can be very puzzling.

So, what are p values?

P-value, often known as “probability value,” is a numerical measure of the chances that a particular event would occur by coincidence.

So, why is P-value important?

It supports us to figure out whether the patterns in the data are most likely caused by a true cause rather than by chance. A low P-value, often less than 0.05, denotes the likelihood that our findings are not random. This assures us that we are on to something important.

In conclusion, what I understood is if the p-value is significantly smaller than the null hypothesis will be rejected.

In the project as I will be using linear regression, p value plays an important role. There may be chances when I calculated the p value using the dataset taking various parameter’s such as inactivity, obesity. I am not sure how am I going to implement it. I will be asking the T.A or the professor on Fridays doubt session.

September 11,2023

While the data is structured and lacks duplicate values, there are numerous other factors to consider when examining variables related to conditions like diabetes and obesity. For instance, if the weather in a particular county or state is excessively cold compared to others, people may increase their food consumption for survival. Furthermore, if a county is located within a state where fast food consumption is prevalent, the likelihood of individuals experiencing inactivity, obesity, and diabetes is significantly higher.

My strategy for tackling this project involves several steps. First, I plan to divide the counties based on their respective states and categorize them into either northern or southern regions. This division will provide valuable insights into why certain states or counties exhibit higher rates of diabetes, obesity, or inactivity.

My initial focus will be on inactivity and obesity, as I believe that inactivity often leads to obesity, which in turn can increase the risk of diabetes. To facilitate this analysis, I have organized the counties using the Federal Information Processing Standard (FIPS) codes, making it easier to group and study the data. Additionally, I have used Python to compute various statistical parameters such as mean, median, and standard variation to gain a deeper understanding of the data’s characteristics.

In conclusion, I plan to collaborate with Dr. Dylan George to determine the specific findings he requires from our data analysis. However, I have some uncertainty regarding the application of Heteroscedasticity using Python, and I intend to seek clarification from my instructors during class discussions.