Re: [libvirt] [PATCH] Replace hashing algorithm with murmurhash

26 Jan 2012


      On Wed, Jan 18, 2012 at 11:34:16AM -0700, Eric Blake wrote:
...
On 01/18/2012 09:23 AM, Daniel P. Berrange wrote:
...
@@ -1636,9 +1637,10 @@ cleanup:
 }
-static unsigned long virCgroupPidCode(const void *name)
+static int32_t virCgroupPidCode(const void *name, int32_t seed)
 {
-    return (unsigned long)name;
+    unsigned long pid = (unsigned long)name;
+    return virHashCodeGen(&pid, sizeof(pid), seed);
I'm not sure if it matters, but you could shorten these two lines to
just one:
return virHashCodeGen(&name, sizeof(unsigned long), seed);
I actually preferred the slightly more verbose style, to make
it clearer that we're actually passing in the PID as an int,
cast to a pointer, instead of an int pointer.
...
...
@@ -101,11 +95,10 @@ static void virHashStrFree(void *name)
     VIR_FREE(name);
 }
-
 static unsigned long
 virHashComputeKey(virHashTablePtr table, const void *name)
 {
-    unsigned long value = table->keyCode(name);
+    uint32_t value = table->keyCode(name, table->seed);
     return (value % table->size);
If, on a 64-bit machine, table->size ever grows larger than uint32_t,
then we've lost some hashing abilities compared to what the old unsigned
long hash gave us.  Conversely, if table-size ever grows that large,
we're burning an awful lot of memory in the first place, and a hash
collision at that point is the least of our worries :)  I think this is
okay, although you may want to consider changing virHashComputeKey to
return uint32_t.
I changed the signature here to actually be  'size_t', likewise
for the table->size variable itself.
...
...
+++ b/src/util/hash.h
@@ -55,12 +55,15 @@ typedef int (*virHashSearcher) (const void *payload, const void *name,
 /**
  * virHashKeyCode:
  * @name: the hash key
+ * @seed: random seed
  *
- * Compute the hash code corresponding to the key @name
+ * Compute the hash code corresponding to the key @name, using
+ * @seed to perturb the hashing algorithm
  *
  * Returns the hash code
  */
-typedef unsigned long (*virHashKeyCode)(const void *name);
+typedef int32_t (*virHashKeyCode)(const void *name,
+                                  int32_t seed);
Umm, you used uint32_t in virHashComputeKey, but int32_t here.  While I
don't _think_ signed values will bite us due to sign-extension in 64-bit
operations introducing a bias, I'd rather make this signature:
typedef uint32_t (*virHashKeyCode)(const void *name, uint32_t seed);
and adjust the callers to comply with unsigned everywhere.
Yeah, that was a mistake - it should have been uint32.
...
That is, I propose that we ditch virRandom, and instead implement a new
function:
/* Return an evenly distributed random number between [0,2^nbits), where
nbits must be in the range (0,64]. */
uint64_t virRandomBits(int nbits)
{
    int bits_per_iter = count_one_bits(RAND_MAX);
    uint64_t ret = 0;
    int32_t bits;
    /* This algorithm requires that RAND_MAX == 2^n-1 for some n;
       gnulib's random_r meets this property. */
    verify(((RAND_MAX + 1U) & RAND_MAX) == 0);
virMutexLock(&randomLock);
while (nbits > bits_per_iter) {
        random_r(&randomData, &bits);
        ret = (ret << bits_per_iter) | (bits & RAND_MAX);
        nbits -= bits_per_iter;
    }
random_r(&randomData, &bits);
    ret = (ret << nbits) | (bits & ((1 << nbits) - 1));
virMutexUnlock(&randomLock);
    return ret;
}
where existing calls to virRandom(256) will now be virRandomBits(8),
virRandom(1024*1024*1024) will now be virRandomBits(30), and your code
would use uint32_t seed = virRandom(32).
Yep, I've done that.
...
...
+#include "virhashcode.h"
+
+static uint32_t rotl(uint32_t x, int8_t r)
+{
+    return (x << r) | (x >> (32 - r));
+}
Gnulib provides the bitrotate module, which provides "bitrotate.h" and
an implementation of this function that is guaranteed to be as efficient
as possible when compiled by gcc (possibly by using compiler-builtins to
access raw assembly on architectures that have builtin rotate
instructions).  I'd suggest using that rather than rolling your own.
Good point.
...
...
+
+/* slower than original but is endian neutral and handles platforms that
+ * do only aligned reads */
+__attribute__((always_inline))
+static uint32_t getblock(const uint8_t *p, int i)
+{
+    uint32_t r;
+    size_t size = sizeof(uint32_t);
+
+    memcpy(&r, &p[i * size], size);
+
+    return le32toh(r);
<endian.h>, and thus le32toh(), is not yet standardized (although POSIX
will be adding it in the future), nor is it currently provided by
gnulib.  We'd have to get that fixed first.
The le32toh call was only here because the code I copied wanted to be
endian neutral. I don't think libvirt really cares if its hash codes
are endian neutral, so I trivially just removed the le32toh call and
avoid the problem of endian.h


Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|