24 Million entries and I need to what?

Sasha Pachev sasha at asksasha.com
Fri Dec 27 13:00:45 MST 2013


The absolute fastest way to solve this problem I can think of short of
resorting to hand-crafted assembly is to write a C program using a
fast hash library, e.g uthash, then split the key space among threads
and go for it. There are a few technical gotchas. Off the top of my
head without actually having my hands on it (which means my ideas need
to be taken with a grain of salt) I am thinking of the following that
needs to be taken care of:

- maintain a local hash for each thread and synchronize/dump/merge
into the global in batches. Otherwise multi-processor will be of
little help because you have to have to take a mutex when writing into
the global hash, and this will cause a serious performance nightmare
if you do it once per entry.
- if your key space is read from files, mmap() the files
- the devil is often in the details - make sure you are
parsing/generating strings efficiently, do things in place when
possible, etc
- to test the correctness make the key space of two threads partially
overlap on purpose and see if you are catching the duplicates


-- 
Sasha Pachev

Fast Running Blog.
http://fastrunningblog.com
Run. Blog. Improve. Repeat.


More information about the PLUG mailing list