What the heck?
S. Dale Morrey
sdalemorrey at gmail.com
Fri Dec 13 12:20:01 MST 2013
Thanks Levi. That's some very sage advice.
To be clear where I'm coming from. I already wrote an app in node.js that
did exactly what I needed it to do, i.e. stuff the entire tx chain of
bitcoin into an RDS so I can query it later using SQL style queries (part
of a service I'm working on similar to blockchain.info, but meant for
merchants to quickly look up balances).
This is sort of my own "hello world" for node. :)
The problem I am trying to solve here is that the application is horribly
slow. Therefore I decided to refactor it (actually rewrite from scratch
might be a better term) into individual execution units, string them
together with message queues and have each unit run on it's own amazon spot
instance. This gives me the ability to bring more dedicated execution
units online to handle the various work flow stages depending on queue size
and work remaining.
With the original version of the application, I was looking at over a month
and possibly much longer to get everything into the database. My goal is
to take that down to a few hours at most. This is possible because there
are many, many possible points of parallelization.
Unfortunately doing it this way would also swamp my datasources, so I
needed to stagger the calls a little bit so I don't get cut off/banned.
At the moment I have 2 datasource providers (what the clientnum are
actually connecting to), but I wanted to be able to bring more online to
handle the workload if needed.
So there are a few steps involved.
Execution Unit #1
The first step is to get the total number of blocks. Next the blockhash at
n++ is queried for and all the hashes are placed into an array. When 7kb
of blockhashes are in the array, a new sqs message is sent. The array is
cleared and gathering begins anew.
Execution Unit #2
Reads the message queue, fetching arrays of block hashes. The array is
treated like a stack and a hash is popped off the top. We then query the
datasource for the actual block referenced by the hash and obtain the tx
hashes contained in it. The rest proceeds as in in XU#1 but messages are
placed into a different queue.
Reads the txqueue fetching the txhashes. Query the datasource for full
tx's for each hash. Do some data transformations on the tx's and stuff
them into an RDS such as mysql.
It sounds simple enough, but I have a limited number of datasource
providers. One is mine and I can control it, the other is a public
resource and I have to be very careful not to overwhelm them. More than 1
query per second from a single IP address or 2 queries per second multiple
ip's on the same account will trigger a disconnect.
There are currently ~275000 blocks. If the average block contains 10tx.
That's over 2.75M queries on top of the blockhash queries. Furthermore, a
tx consists of 2 parts, a txin and a txout. Txouts are simple end points,
but txins contain a reference to the hash of the source tx and an offset of
the txout it originates from. So with the exception of coinbase
transactions I'm looking at a minimum of 2 and an average of probably 5 or
10 reverse lookups per tx. The blockchain itself is only about 10GB right
now. I can see the final datastore being >100GB without even really trying.
This is making me think I might be better off getting my information
directly from the p2p network instead of using a single datasource or even
a handful. But I don't want to try and implement the low level details of
issues in current versions of node. So this is what I've got :(
In the back of my mind I'm seeing an easier and possibly faster way to
accomplish this using a scatter gather technique, but the idea itself is
not clearly formed yet.
On Thu, Dec 12, 2013 at 3:20 PM, Levi Pearson <levipearson at gmail.com> wrote:
> On Thu, Dec 12, 2013 at 2:16 PM, S. Dale Morrey <sdalemorrey at gmail.com>
> > Now I've got to figure out how to slowdown the requests and gradually
> > them to the server. Batching will help somewhat, but the max I can send
> > a batch is 100 and even then it's going to quickly overwhelm the server
> > send them out without a delay between sends.
> My previous email suggested setting the timeout to i * 500ms, which
> will space them 500ms apart. You can easily space them more the
> obvious way. To introduce a batching-style periodic delay of, say, 5
> seconds for every 100 requests, you would simply add another term to
> your delay calculation by doing an integer divide of i by 100 and
> multiplying by 5 seconds.
> > Another option is to get rid of the loop entirely and just have the
> > callback calling in a cycle, but I'm afraid that's going to smash the
> > :(
> One of the most frustrating aspects of callback-style programming, but
> perhaps fortuitous for you in this instance, is that each callback
> executes with a fresh stack. It's in no way connected to the stack
> frame that generated the closure; indeed, the callback might be a
> top-level function and not a nested closure at all. So if each
> invocation updates its loop variables and sets a timeout callback for
> 500ms in the future to invoke itself, you'll use no extra stack space
> at all. This will also eliminate the problem of the closures always
> referring to the loop index variable of a loop that already completed!
> PLUG: http://plug.org, #utah on irc.freenode.net
> Unsubscribe: http://plug.org/mailman/options/plug
> Don't fear the penguin.
More information about the PLUG