You want to maintain an approx. convex hull of a stream of points . Assume is large and unknown, points aren’t necessarily drawn from the same distribution, and the approx. hull can be requested at any time during processing. Here are a few ways that we might do this…
— Keep the min/max values for each dimension. These values can be used to construct a bounding box of the true hull. This can yield a pretty bad approximation that 1) overestimates the volume and 2) isn’t bounded. Ooof.
— Keep the points with min/max values in each direction. When the hull is required we can compute it from this subset of points in time. This algorithm produces a volume which is no larger than the true hull’s.
— This approach is similar to . Select random unit vectors on the dimension unit sphere. Instead of keeping points, keep the points which maximize or minimize for any one of the unit vectors. For large this should begin to approach the volume of the true hull. This is simulated in the code (with a “nice” example) below.
— I don’t know how large must be to be to ensure our estimate is -close to the true hull’s volume. I tried to estimate that ensures the surface of is within of a vector w.h.p and came up with a dodgy argument that involved the ratio of the surface area of and the volume of . After approximating to , I’m thinking that it decreases exponentially w. a parameter . This is quite large, high dimensions scare me, and something still felt incomplete so I let this line of thinking go…
import numpy as np
import matplotlib.pyplot as plt
from scipy.spatial import ConvexHull
def stream_points(d: int, n: int):
return np.random.normal(0, 1, size=(n, d))
def rand_basis_vect(d: int, m: int):
A = 2 * (0.5 - np.random.random(size=(m, d)))
return A / np.linalg.norm(A, ord=2, axis=1).reshape(m, 1)
n_dim, n_rand_vect, n_points = 2, 8, 64 * 1024
L, H = np.zeros(n_rand_vect), np.zeros(n_rand_vect)
A = rand_basis_vect(n_dim, n_rand_vect)
estbuf, truebuf = [], []
for p in stream_points(n_dim, n_points):
S = np.dot(A, p)
nL, nH = np.minimum(L, S), np.maximum(H, S)
if abs(np.linalg.norm(nL - L)) > 0:
L = nL
estbuf.append(p)
if abs(np.linalg.norm(nH - H)) > 0:
H = nH
estbuf.append(p)
truebuf.append(p)
truebuf = np.array(truebuf).reshape(len(truebuf), n_dim)
estbuf = np.array(estbuf).reshape(len(estbuf), n_dim)
hull, esthull = ConvexHull(truebuf), ConvexHull(estbuf)
plt.scatter(truebuf[:, 0], truebuf[:, 1], marker=".", alpha=0.5)
for simplex in hull.simplices:
plt.plot(truebuf[simplex, 0], truebuf[simplex, 1], "r-")
for simplex in esthull.simplices:
plt.plot(estbuf[simplex, 0], estbuf[simplex, 1], "k-", alpha=0.5)
plt.scatter(estbuf[esthull.vertices, 0], estbuf[esthull.vertices, 1], color="k", alpha=0.8)
plt.scatter(truebuf[hull.vertices, 0], truebuf[hull.vertices, 1], color="r")
plt.show()— Wrote this second section (everything below this box) as a “quick follow-up” a few weeks later. It got kinda long and almost required its own post. Instead I decided to just smash the two together…
Recall from above. If we’ve implemented this algorithm, how can we determine if a point included in the approximate convex hull is likely to be in the true convex hull? Here’s a quick hand-waving argument using the positive reals in 2-dimensions:
Assume we know the distribution function of the distances, . For illustration’s sake we’ll assume the pdf is sub-Gaussian. We’ll use this property in step .
There exists for each which satisfies . Thus, the probability that a point is a member of the true hull is at least . Note that this is quite an overestimate of the replacement probability as the probability is replaced by a “better” point () depends on the direction and magnitude of the replacing point.
We could extend the criteria from to all points and say that the probability we have the correct convex hull is at least . This is nice and straightforward, but we can get closer…
— A point replaces if it satisfies . For example, a point is replaced if there’s a point with in the first quadrant. We can integrate over to find the required vector length (and its probability) conditional on the difference in angles:
The equations above assume a sub-Gaussian distribution of distances and radial symmetry. You’d be very lucky to have symmetry and known . If you’ve got that - you may as well just use the result in . This will help you (future me) if you ever need to work on this seriously: Some Desmos Equations. Good Luck!