The gradient descent is a derivative of the MSE, so would be (2/m) instead of (1/m). You’ll see the same eventual result but the correct equation converges more quickly.
Also, using the variable ‘m’ for length when it is normally used for slope is confusing.
Overall, good article and great use of numpy linear algebra dot products for efficiency. I also like storing theta (slope, bias) and cost history for each iteration/epoch to see progress.