On Tue, Aug 15, 2017 at 10:02:38 -0400, John Ferlan wrote:
On 08/15/2017 08:01 AM, Peter Krempa wrote:
> On Mon, Aug 14, 2017 at 20:46:10 -0400, John Ferlan wrote:
>> On 08/03/2017 04:50 AM, Peter Krempa wrote:
>
> [ trimmed off-topic part ]
>
>> NB: I didn't dig into the qemumonitorjsontest algorithm, but...
>>
>> While going through the common object changes, I ended up looking at and
>> thinking about the hash table algorithms, initially it was just the
>> "grow" algorithm as I think it is a bit "aggressive"
(perhaps wasteful)
>> since as soon as 1 bucket chain exceeds 8 elements, the table is resized
>> by 8 unless the new size is 8 * 2048 - at which time no matter how full
>> a bucket is we don't resize the table
>
> Resizing the table is very expensive, so it makes sense to scale it
> aggresively. At some point though it would take too much memory.
>
> If some place in the code expects massive hash table, it can set the
> hash table to > 2048 manually.
>
Right - avoiding excessive or unnecessary resizing is something I was
considering. One thing that was at the back of my mind is "somewhat
recent" discussions about 10s of thousands of disk/volumes. IIRC it's
been discussed on qemu list as it relates to libguestfs.
If we assume that our hash function has an even distribution of hashes
(which was not refuted yet) then for 10k entries you should have 5ish
entries per bucket. Even with 100k you are at 50 per bucket. That is
totally managable.
Because volume names tend to be less random than a UUID, I was
looking
at how the existing algorithms operate when you have say "disk00001"
through "disk99999" w/r/t bucket layout and length of chains when we
reach the maximum bucket count.
Still even with that, resizing a table just because one hash bucket
chain goes above 8 just seemed unnecessary. Still a work in progress
though - I may come up with nothing. It's the classic power struggle of
time vs. space.
(This was covered by dan's reply)
>> The next thing I considered was using a prime number as the
table bucket
>> size and it seems by just doing that will cause qemumonitorjsontest to
>
> What would be the benefit of such change? I don't really see how prime
> numbers would improve anything performance-wise.
>
Nothing performance wise directly that comes to mind. I see that while
composing Michal and Daniel have responded.
FWIW: My hacking has altered virhashtest.c to generate 1000's of UUIDs
and "elemN" names in order to check if/when one or the other has a
bucket full problem with the existing algorithms. It's been a couple
weeks since I was looking at the data - so results are not fresh in mind
especially since I was on PTO last week.
>> fail. Easy enough to reproduce by changing the virHashCreate call in
>> qemuBlockNodeNameGetBackingChain to use 53 instead of 50 (even 49 causes
>> failure). It doesn't matter if the test source also adjusts the initial
>> hash size.
>>
>> Of course this makes sense given that virHashCodeGen is the backend of
>> the virHashComputeKey algorithm that uses the return value %
>> table->size. So if "size" changes, then obviously order changes.
>
> As you've said that is expected. I don't quite follow what is your
> point here.
>
I was pointing out that by changing the size you get different ordered
results, nothing more, nothing less. It just so happens that 50 returns
this set of results, but 53 is different.
>> While I get the code bloat conundrum, but if the order doesn't matter
>> then I wonder what other variables could change that would affect this
>> test and whether we really want to be constantly altering the mocks
>> "just" to get the right answer for this test.
>
> The order depends on the result of the hasing function, the number of
> buckets and the order you've added the entries. All of those is
> expected when dealing with a hash table.
>
> After this last patch, you are guaranteed to have the hash function
> return deterministic results (even on big endian-boxes).
>
Change the size to 53, rebuild, and run the test. It'll take a few
minutes, but I think you'd see a failure.
I fail to see the point of such excercise. It is expected that it will
happen. If we wanted to make the testsuite slightly more robust in this
regard, we could also mandate a fixed hash size for the mocked hash, but
I don't quite feel the need to do so.
> The hash size is not really altered very often since you
can't really
> see a measurable performance benefit in most cases. This is also
> strenghtened by the fact, that the hash function is randomized by a
> random number, so in real usage you won't be able to deterministically
> cause hash collisions.
True... at 50 with 4 elements you won't see a resize. You could possibly
have a table size of 1 in this instance and be fine, but that only works
well for the test! It's more problematic for real world because a table
size of 1 grows to 8, 64, etc. Some would say the minimum table size
should be at least 8 if not the next prime, e.g. 11.
This is not a problem for real use cases. Real use cases don't care
about the ordering and thus it does not matter whether it's resized or
not. Hash table is a wrong data structure for lists.
In the tests it's a bit different. The goal of the testsuite should be
to allow to write simple tests. Otherwise it will not be easy to
persuade people to add tests at all. This means that sometimes it's
necessary to cut corners.
In case of the testsuite that was failing, the result was correct in any
ordering. The checker I wrote was dumb though ... really dumb. Just a
strcmp. I don't want to waste time to write tests to do a full N to N
checker. That does not make any sense.
If the code uses a hash table for convenience/performance reasons and
it's convenient to test the contents of the table in the test suite,
it's just best to format it to a string and compare it against a file.
Especially since you get "VIR_TEST_REGEGERATE_OUTPUT" if you happen to
modify the test. That would be rather compicated with a specific hash
table checker and make test-writing more painful.
In the long run, UUID's are random, but "#blockN"
values are far less
random - it's the human randomness factor. No one is going to create 100
randomly named domain names or volume names, so factoring in the less
than randomly named elements is what I've been pondering.
This also was covered by dan's statement about the hashing function.
> In the test code, the cases are added in a deterministic order.
Since
> the hashing function in the tests is deterministic as well, the only
> thing that would influence the ordering is the bucket count. We don't
> modify them too often.
>
> I don't see what would make us alter mocks "just" to get the right
> answer here. The reason for this patch was, that the hash function is
> using bit operations and then treating the result as integers, which
> obviously does not work deterministically between big and little endian
> arches. Other than that, the test is now fully deterministic.
>
> Peter
>
Understood - until someone changes the size of the table to be random or
less deterministic, then there's no problem. But if that happens,
there's a problem.
Umm, why would anybody do that? It would not make any sense to allocate
it at random. That's plain crazy. For anything other than that, it would
be deterministic, and thus while the test output might change at the
point when it would be changed, it would not lose determinism in the
suite. The only randomization which makes sense is in the hashing
function itself and we do have that.
Also as it was pointed out. The bucket count choice does not really add
much in terms of distribution of elements to the buckets. Thanks to the
auto-resize algorithm it's even somewhat resistant to undersized hashes.
Adding any kind of magic on top of this just does not seem to be worth
at all.