Ask HN: How do you measure confidence for similarity metrics?

1 point

10 years ago

Hey folks, I'm interested in measuring the Sørensen–Dice coefficient between two objects.

Some of the objects are much larger than others, and there can still be overlap between them. Therefore I'm interested in determining confidence level for their similarities and implications for sample sizes.

Here's some sample code I was playing with:

  const clj_fuzzy = require('clj-fuzzy');
  
  let a = [1,2,3,4,5,6].toString();
  let b = [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26].toString();
  
  console.log(a);
  console.log(b);
  
  console.log('dice', clj_fuzzy.metrics.dice(a, b));
  
  /*
  
  Output:
  
  1,2,3,4,5,6
  
  2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26
  
  dice 0.43478260869565216
  
  
  */

That seems like a reasonable approximation of what I want, and works well for the different lengths and overlaps entailed by the use case I have in mind.

However, I'd like to be able to understand it a little better in order to communicate to others.

It seems like confidence ~= similarity but I don't have the chops to know why yet. Any pointers? Thanks!