Unsupervised Learning – K-means Clustering

Clustering is a kind of unsupervised learning where we try to identify similarities and patterns from unlabeled data. Data points which exhibit similarities are grouped together into a cluster, and there is no fixed way these data points can be separated — clustering is subjective. Ideally, good clusters will exhibit high intra-class similarity and low inter-class similarity. This would mean that the maximum distance between points within a cluster will be small, but the distance between clusters is large. There are several ways to determine similarity, which shall be covered in a separate blog post. Continue reading “Unsupervised Learning – K-means Clustering”

Unsupervised Learning – K-means Clustering

Random Restart Hill Climbing Algorithm

Advantages of Random Restart Hill Climbing:
Since you randomly select another starting point once a local optimum is reached, it eliminates the risk that you find a local optimum, but not the global optimum. Also, it is not much more expensive than doing a simple hill climb as you are only multiplying the cost by a constant factor — number of times you want to do a random restart.

Disadvantages of Random Restart Hill Climbing:
If your random restart point are all very close, you will keep getting the same local optimum. Care should be taken that the next random restart point should be far away from your previous. This would allow a more systemic approach to random restarting.

Random Restart Hill Climbing Algorithm

Finding Coordinates within California’s Boundaries — Ray Casting Method

###
Here is the Python code I used to identify the points that are within California’s geographical boundaries. The coordinates used for the 4 outermost corners of the polygon are estimated locations that are situated slightly beyond the actual geographical boundaries so that we are able to capture all points in a 4-sided polygon.

The output file will contain only the data for points that fall within the identified polygon.


# Helper code found on http://geospatialpython.com/2011/01/point-in-polygon.html
# Determine if a point is inside a given polygon or not
# Polygon is a list of (x,y) pairs. This function returns True or False.
# The algorithm is called the "Ray Casting Method".

import os
def point_in_poly(x,y,poly):

n = len(poly)
inside = False

p1x,p1y = poly[0]
for i in range(n+1):
p2x,p2y = poly[i % n]
if y > min(p1y,p2y):
if y <= max(p1y,p2y):
if x <= max(p1x,p2x):
if p1y != p2y:
xints = (y-p1y)*(p2x-p1x)/(p2y-p1y)+p1x
if p1x == p2x or x <= xints:
inside = not inside
p1x,p1y = p2x,p2y

return inside

polygon = [(41.979656,-125.678101),(41.995733,-119.761963),(32.888237,-112.763672),(31.521191,-119.443359)]

with open('quake_data.txt') as f:
content = f.readlines()

for lines in content:
col = lines.split(',')
lat = float(col[1])
llong = float(col[2])

## Call the function with the points and the polygon. Write results out to text file
print point_in_poly(lat,llong,polygon)
if point_in_poly(lat,llong,polygon) == False:
pass
if point_in_poly(lat,llong,polygon) == True:
out_string = ''
for c in col:
out_string += c.rstrip(os.linesep) + ','
out_string += '\r'
my_file = open('point_in_poly.txt', 'a')
my_file.write(out_string)
Finding Coordinates within California’s Boundaries — Ray Casting Method