What the heck?
ddavidegli at gmail.com
Tue Dec 17 02:56:20 MST 2013
On December 15, 2013 S. Dale Morrey wrote:
> Now let's hope my choice of RDS for this project (mysql cuz I'm an idiot
> a cheap one at that), doesn't choke on the fact that I'm cramming 20+GB of
> data into a single table.
I kind of doubt it, but if it does choke, you could always try out
Postgres. I hear it's able to handle things better at times, especially
large data sets. I've really never tried more than about 10% of that in a
single table in any DB, but I know larger tables exist, and I imagine
larger tables exist in the Open Source databases like MySQL or PGSQL. Good
luck either way! :)
On Sun, Dec 15, 2013 at 1:48 AM, S. Dale Morrey <sdalemorrey at gmail.com>wrote:
> So this seems to be working out really well now.
> I've got the entire thing operational and the finalized data looks about as
> I would expect it to.
> Total Execution time to this step 892.853 seconds
> Total blocks complete: 17099 of 274910
> 15 minutes for ~17,000 blocks.
> That's 68,000 blocks per hour!
> It's going to be a LOT less than several days to get this DB fed. More
> like 4 hours. That's for the entire blockchain including transactions
> Now let's hope my choice of RDS for this project(mysql cuz I'm an idiot and
> a cheap one at that), doesn't choke on that fact that I'm cramming 20+GB of
> data into a single table.
> Thanks for all the help!
> On Fri, Dec 13, 2013 at 11:20 AM, S. Dale Morrey <sdalemorrey at gmail.com>
> > Thanks Levi. That's some very sage advice.
> > To be clear where I'm coming from. I already wrote an app in node.js
> that did exactly what I needed it to do, i.e. stuff the entire tx chain of
> bitcoin into an RDS so I can query it later using SQL style queries (part
> of a service I'm working on similar to blockchain.info, but meant for
> merchants to quickly look up balances).
> > This is sort of my own "hello world" for node. :)
> > The problem I am trying to solve here is that the application is horribly
> slow. Therefore I decided to refactor it (actually rewrite from scratch
> might be a better term) into individual execution units, string them
> together with message queues and have each unit run on it's own amazon spot
> instance. This gives me the ability to bring more dedicated execution
> units online to handle the various work flow stages depending on queue size
> and work remaining.
> > With the original version of the application, I was looking at over a
> month and possibly much longer to get everything into the database. My
> goal is to take that down to a few hours at most. This is possible because
> there are many, many possible points of parallelization.
> > Unfortunately doing it this way would also swamp my datasources, so I
> needed to stagger the calls a little bit so I don't get cut off/banned.
> > At the moment I have 2 datasource providers (what the clientnum are
> actually connecting to), but I wanted to be able to bring more online to
> handle the workload if needed.
> > So there are a few steps involved.
> > Execution Unit #1
> > The first step is to get the total number of blocks. Next the blockhash
> at n++ is queried for and all the hashes are placed into an array. When
> 7kb of blockhashes are in the array, a new sqs message is sent. The array
> is cleared and gathering begins anew.
> > Execution Unit #2
> > Reads the message queue, fetching arrays of block hashes. The array is
> treated like a stack and a hash is popped off the top. We then query the
> datasource for the actual block referenced by the hash and obtain the tx
> hashes contained in it. The rest proceeds as in in XU#1 but messages are
> placed into a different queue.
> > Execution Unit#3
> > Reads the txqueue fetching the txhashes. Query the datasource for full
> tx's for each hash. Do some data transformations on the tx's and stuff
> them into an RDS such as mysql.
> > It sounds simple enough, but I have a limited number of datasource
> providers. One is mine and I can control it, the other is a public
> resource and I have to be very careful not to overwhelm them. More than 1
> query per second from a single IP address or 2 queries per second multiple
> ip's on the same account will trigger a disconnect.
> > There are currently ~275000 blocks. If the average block contains 10tx.
> That's over 2.75M queries on top of the blockhash queries. Furthermore, a
> tx consists of 2 parts, a txin and a txout. Txouts are simple end points,
> but txins contain a reference to the hash of the source tx and an offset of
> the txout it originates from. So with the exception of coinbase
> transactions I'm looking at a minimum of 2 and an average of probably 5 or
> 10 reverse lookups per tx. The blockchain itself is only about 10GB right
> now. I can see the final datastore being >100GB without even really
> > This is making me think I might be better off getting my information
> directly from the p2p network instead of using a single datasource or even
> a handful. But I don't want to try and implement the low level details of
> issues in current versions of node. So this is what I've got :(
> > In the back of my mind I'm seeing an easier and possibly faster way to
> accomplish this using a scatter gather technique, but the idea itself is
> not clearly formed yet.
> > On Thu, Dec 12, 2013 at 3:20 PM, Levi Pearson <levipearson at gmail.com>
> >> On Thu, Dec 12, 2013 at 2:16 PM, S. Dale Morrey <sdalemorrey at gmail.com>
> >> > Now I've got to figure out how to slowdown the requests and gradually
> >> > them to the server. Batching will help somewhat, but the max I can
> send in
> >> > a batch is 100 and even then it's going to quickly overwhelm the
> server to
> >> > send them out without a delay between sends.
> >> My previous email suggested setting the timeout to i * 500ms, which
> >> will space them 500ms apart. You can easily space them more the
> >> obvious way. To introduce a batching-style periodic delay of, say, 5
> >> seconds for every 100 requests, you would simply add another term to
> >> your delay calculation by doing an integer divide of i by 100 and
> >> multiplying by 5 seconds.
> >> >
> >> > Another option is to get rid of the loop entirely and just have the
> >> > callback calling in a cycle, but I'm afraid that's going to smash the
> >> > :(
> >> One of the most frustrating aspects of callback-style programming, but
> >> perhaps fortuitous for you in this instance, is that each callback
> >> executes with a fresh stack. It's in no way connected to the stack
> >> frame that generated the closure; indeed, the callback might be a
> >> top-level function and not a nested closure at all. So if each
> >> invocation updates its loop variables and sets a timeout callback for
> >> 500ms in the future to invoke itself, you'll use no extra stack space
> >> at all. This will also eliminate the problem of the closures always
> >> referring to the loop index variable of a loop that already completed!
> >> --Levi
> >> /*
> >> PLUG: http://plug.org, #utah on irc.freenode.net
> >> Unsubscribe: http://plug.org/mailman/options/plug
> >> Don't fear the penguin.
> >> */
> PLUG: http://plug.org, #utah on irc.freenode.net
> Unsubscribe: http://plug.org/mailman/options/plug
> Don't fear the penguin.
More information about the PLUG