Embedded full-text search

Levi Pearson levipearson at gmail.com
Wed Jun 6 14:24:38 MDT 2012


On Wed, Jun 6, 2012 at 12:57 PM, AJ ONeal <coolaj86 at gmail.com> wrote:
>>
>> > http://ft3.sourceforge.net/
>> > http://www.sqlite.org/fts3.html#section_1
>> >
>> > I don't think this is built-in to the popular sqlite implementations yet.
>>
>> That's what the cross-compiling toolchain is for. :)
>>
>
> UGH! NEVER! For the love of all that is holy, don't get me started on the
> evils of cross-compiling.
> Native compile all the way. Waste 45 minutes on compiling and save yourself
> weeks of troubleshooting and toolchain problems.
>
> Let's say you were using ruby or python or nodejs or lua. It's not a matter
> of compiling the c library, it's a matter of writing the bindings.
>

Heh, you're spoiled.  Most of the programming I've done has been for
embedded systems that don't have the resources for a self-hosted
compiler.  I had one platform early on that had toolchain issues, but
mostly it hasn't been a problem for me.  The payoff for getting your
cross-compile toolchain working well is definitely worth it,
especially when you're doing all your development in C or C++ and
you've got a lot of code to develop.  If you have weeks of
toolchain-related trouble, you need a new toolchain or a new platform.

>> I was a bit baffled that when approaching a full-text search problem,
>> the first thing you reached for was a SQL database.  I guess if you've
>> got the data in a database anyway as part of a media storage system,
>> that makes sense.  This seems like a reasonable solution.
>
> The first thing I reached for was in-memory storage, then rolling my own
> file-system-based db with caching, but getting up past 100,000 files
> started to be unreasonable to manage.

I'm curious about the approach you took with the filesystem based db
with caching.  In what way was the database filesystem-based?

If I had to code something, my first thought would be to use something
like this:

The main string storage would be a memory-mapped file that stored the
strings in null-terminated form one after another.  There would be a
separate word index (stored again in a memory-mapped file so it won't
have to be re-built every time), probably in the form of a ternary
search trie, that would use the words as keys and have some sort of
collection of byte-offsets for the strings it occurs in as the value
associated with the key.

To search, you would follow the trie until you found a match or
failure, then add the offsets to the pointer to the database to
retrieve the strings.  The tries also have the nice feature of
supporting prefix and near neighbor searches and being pretty fast.
Ternary search tries are slower but more space-efficient that normal
tries, but similar in speed but smaller in space than a typical hash
table.

To add a string, you just append it to the file and keep track of its
starting offset from the beginning of the file. You then insert each
unique word from the string into the index structure.  If a search
reveals it's already there, the new string's index gets appended to
the key word's collection. Otherwise it's added as a new key with a
single valued collection.

Using memory mapped files has the advantage of letting the OS manage
paging and caching in a fairly efficient way, and it's pretty darn
simple to use. Using it in this fairly simple way could lose changes
in the mapped pages to a crash, though.

        --Levi


More information about the PLUG mailing list