I want to outline a few of the things that I learned (or rediscovered since my last stats course) in the process of building a linear regression model from scratch. Building a regression model from scratch was step 1 in my process of building up a logistic mixed model from scratch, as I outlined previously.
You can find my code on github. In order to meet my goal, this regression model does a few simple things:
In the notebook you can also see the comparison of this output compared to statsmodels. It’s identical aside from the rounding.
Multiple linear regression is quite simple to compute
I always knew multiple linear regression had a closed form solution but it feels like quite something else to see that you can calculate about any linear regression in < 10 lines of code with numpy. Linear regression has substantial power (I still use it often) and it is really cool to see that it boils down to just a few high level linear algebra operations. This feeling is a total matter of perspective though, if I was writing the numpy inverse matrix function I may not feel quite the same. The resource I found most helpful in writing this original function was some Penn State graduate course online material.
Bootstrapping is more complex than meets the eye
If you’re not familiar, bootstrapping is a method to sample from the original data with replacement and refit the specified model a bunch of times. This resampling and refitting ends up being an accurate way to estimate variability of coefficients and test statistics. I often feel like bootstrapping is a tool that folks grab for quickly—’Oh, just bootstrap it’. However, going through this process of writing a bootstrap function made me realize there are a multitude of ways to bootstrap and that it can really matter which one you choose. I mainly used some MIT lecture notes(pdf warning) to learn the specifics when coding. I chose to code the empirical bootstrap as a matter of simplicity but I could have also chosen parametric, semi-parametric, or percentile based bootstrap functions.
Overall p-values for models are just a comparison against a dummy baseline
My final learning I want to note is how we calculate an overall p-value for a multiple linear regression. The p-value for the overall model is roughly answering the question ‘Is there a relationship between the predictors and target variable that is stronger than we could expect by pure chance alone?’ The dummy baseline part of this comparison is what struck me while I was writing this code. My graduate coursework was much more on the algorithmic modeling culture(read: machine learning) than the traditional statistics culture of making assumptions about how data is generated. So seeing that an overall p-value is just a comparison of the full multiple regression model with intercept and predictors against a ‘restricted model’ of only the intercept was interesting. It reminded me of how machine learning papers (at least when I was still reading them) often assume a naive baseline to compare a new model against. Of course this makes sense, it’s likely that the machine learning literature drew from traditional statistics parallels when creating these methods.
Writing this up took me longer to get to than expected. Likely because I took too much of my own advice about doing something unproductive, however I’m still committed to finishing this series. Part two, logistic regression, coming soon!