Probably a pretty straightforward question, but I'm just blanking on it. Any help is appreciated.

Hopefully, I can set this up in a non-confusing way:

I have a cubic box in a three-dimensional Euclidean space. (let's say 0 <= x,y,z <= 100)

I have a large number of points of interest inside this box. (call it A)

I have a subset of A that is super-special (call it B). The points of B form a cluster that is amorphous, but the points of B are in such proximity that you could think of B as an unbroken, blob-like object.

I want to find all points x such that:

1.) x is in A

2.) x is not in B

3.) x is within a distance R of any point in B.

To put it a different way, given an amorphous macromolecule (composed of atoms) surrounded by particles (different atoms), I want to find all of the particles not in the macromolecule that are within a distance R of the macromolecule.

I can brute-force this by calculating distances between every pair of particles in the groups of interest, but I know there has to be a better way, and I'm just drawing a blank on it. Any suggestions?

## Neighbors of a cluster: What is this called?

**Moderators:** phlip, Moderators General, Prelates

- WibblyWobbly
- Can't Get No
**Posts:**506**Joined:**Fri Apr 05, 2013 1:03 pm UTC

### Re: Neighbors of a cluster: What is this called?

My initial thought is to narrow down the search space by finding the bounding box of B, expand it by R, and brute force from there. Starting from the convex hull would narrow down the search space even further, but would require more work to calculate (assuming you want to do this for any arbitrary set B).

she/they

gmalivuk wrote:Yes. And if wishes were horses, wishing wells would fill up very quickly with drowned horses.King Author wrote:If space (rather, distance) is an illusion, it'd be possible for one meta-me to experience both body's sensory inputs.

### Re: Neighbors of a cluster: What is this called?

one way to get it down to about O(nk) where n is the number of points and k is the number of close-to-B points is by doing a dijkstra like algorithm. first treat the points as a graph with n vertices and n^2 edges.

then run dijkstra, but start with the set B instead of the singleton set containing the root, and you stop the search once you reach a node of distance R. then all the nodes so far found, minus the set B, is the nodes you want.

since every time you add a node to your set, you can also add up to n edges, then you may potentially end up with O(nk) edges in your priority queue. since the brute force solution is O(n^2) when k is a small fraction of n, this could potentially save a lot of time

edit: a further optimization is whenever you add a node to your frontier, instead of adding all O(n) edges to the priority queue, only add edges of length R-d or less, where d is the distance the current node is at. then, there can be at most k of these edges cutting the runtime down to O(k^2)

then run dijkstra, but start with the set B instead of the singleton set containing the root, and you stop the search once you reach a node of distance R. then all the nodes so far found, minus the set B, is the nodes you want.

since every time you add a node to your set, you can also add up to n edges, then you may potentially end up with O(nk) edges in your priority queue. since the brute force solution is O(n^2) when k is a small fraction of n, this could potentially save a lot of time

edit: a further optimization is whenever you add a node to your frontier, instead of adding all O(n) edges to the priority queue, only add edges of length R-d or less, where d is the distance the current node is at. then, there can be at most k of these edges cutting the runtime down to O(k^2)

### Re: Neighbors of a cluster: What is this called?

Depending on exactly what you are doing, and how the data are organized, and how many data points there are, the following approaches may be beneficial:

• Split the universe into spatial bins (small boxes, like a 3D checkerboard) by coordinates, and keep track of which atoms are in which bins. This will let you see which bins have any members of B in them, and thus quickly narrow down which bins you need to look at.

• Store the data in a k-d tree.

• Approximate the shape of B with either a bounding box, or a convex hull, or a stick-frame skeleton (similar to a spanning tree).

You might also want to read up on collision detection.

• Split the universe into spatial bins (small boxes, like a 3D checkerboard) by coordinates, and keep track of which atoms are in which bins. This will let you see which bins have any members of B in them, and thus quickly narrow down which bins you need to look at.

• Store the data in a k-d tree.

• Approximate the shape of B with either a bounding box, or a convex hull, or a stick-frame skeleton (similar to a spanning tree).

You might also want to read up on collision detection.

wee free kings

### Re: Neighbors of a cluster: What is this called?

What you want is a spatial datastructure, which allows you to do calculations like finding nearest neighbors in O(log(n)) time. The k-d trees that Qaanol mentioned are a common solution.

### Re: Neighbors of a cluster: What is this called?

Random thoughts:

- Bin your points into a 3D grid where each cell is a cube with sides of length R/sqrt(3). For every particle in B, mark the cell it's inside, and make a list of all the particles from B in each cell. For each particle from A, calculate what cell it's in. If a particle from A is within a marked cell, then it's less than R away from a particle in B, so that goes into your list. Mark those particles as "done", but keep the rest of the particles in A as "unknown". This step cuts out a chunk of A without even having to do any distance calculations.

- Set up another 3D grid with lengths of side R this time. Mark every grid cell with a particle from B inside it *as well as every cell adjacent to that cell*, and list all of the B particles in each cell. For each particle A that's not in a marked cell, or is already known to be within R of a particle from B from the previous step, you can ignore it. For each particle A that's in a marked cell, check the distances only to all particles B within its cell and all adjacent cells (27 cells) to see whether A is close enough to be in your set. Stop the loop as soon as you find one particle B that's close enough, because then you know that A is in your set.

You can improve that by using trees and so on, but the basic idea is to chop up your domain and cut out the particles you know aren't going to count. If you're not comfortable with trees, then try binning things in a uniform grid just to get an idea of what's going on. It's not the perfect optimal solution, but it'll still be a huge speed-up over an N^2 calculation over all particles.

As another minor thing, you can improve speed a little by looking at distance^2 rather than distance. The square root operation is one of the slowest basic operations a computer can do. Things often go quite a bit faster if you check whether distance^2<R^2 rather than doing the square root to calculate distance.

- Bin your points into a 3D grid where each cell is a cube with sides of length R/sqrt(3). For every particle in B, mark the cell it's inside, and make a list of all the particles from B in each cell. For each particle from A, calculate what cell it's in. If a particle from A is within a marked cell, then it's less than R away from a particle in B, so that goes into your list. Mark those particles as "done", but keep the rest of the particles in A as "unknown". This step cuts out a chunk of A without even having to do any distance calculations.

- Set up another 3D grid with lengths of side R this time. Mark every grid cell with a particle from B inside it *as well as every cell adjacent to that cell*, and list all of the B particles in each cell. For each particle A that's not in a marked cell, or is already known to be within R of a particle from B from the previous step, you can ignore it. For each particle A that's in a marked cell, check the distances only to all particles B within its cell and all adjacent cells (27 cells) to see whether A is close enough to be in your set. Stop the loop as soon as you find one particle B that's close enough, because then you know that A is in your set.

You can improve that by using trees and so on, but the basic idea is to chop up your domain and cut out the particles you know aren't going to count. If you're not comfortable with trees, then try binning things in a uniform grid just to get an idea of what's going on. It's not the perfect optimal solution, but it'll still be a huge speed-up over an N^2 calculation over all particles.

As another minor thing, you can improve speed a little by looking at distance^2 rather than distance. The square root operation is one of the slowest basic operations a computer can do. Things often go quite a bit faster if you check whether distance^2<R^2 rather than doing the square root to calculate distance.

### Re: Neighbors of a cluster: What is this called?

Alternately, you could look at Friend of Friend algorithms. This is what we use in astrophysics to e.g. find galaxies in a cosmological simulation of particles.

### Re: Neighbors of a cluster: What is this called?

SpitValve wrote:As another minor thing, you can improve speed a little by looking at distance^2 rather than distance. The square root operation is one of the slowest basic operations a computer can do. Things often go quite a bit faster if you check whether distance^2<R^2 rather than doing the square root to calculate distance.

If you just need accuracy within say 5–10%, you can do even better by using α·max + β·min, which lets you replace squaring with multiplication by a constant, and smart choices for the constants let you replace those multiplies by a few shifts and adds. Somewhere I have a spreadsheet in which I calculated near-optimal values for the constants in several dimensions.

wee free kings

### Re: Neighbors of a cluster: What is this called?

If you just need accuracy within say 5–10%, you can do even better by using α·max + β·min, which lets you replace squaring with multiplication by a constant, and smart choices for the constants let you replace those multiplies by a few shifts and adds. Somewhere I have a spreadsheet in which I calculated near-optimal values for the constants in several dimensions.

I think that's only faster if you're using the bit shifts - otherwise you're doing the same number of multiplications, plus a couple of extra comparison steps.

### Who is online

Users browsing this forum: No registered users and 17 guests