The API has been redesigned and restructured recently because of some flaws it used to have. Basically things tend to be simple when we demonstrate simple examples, but it is not about that actually. It is easy when we retrieve a record or a few of them. However if we say it is supposed to scale we may want to store a billion of them (one day ;)). It implies pointlessness of having something like a List or an array in the API when it comes to retrieve records. One problem is the capacity of the Java heap (always limited), elements addressing (most offen int size) and the other problem is how to pull records via REST to be efficient (one by one, chunks, all combined etc). The solution I have come up with seems to be … scalable at least
From the API perspective we have a method that returns or to be more precise loads all records from the particular table (domain/table) and has a callback object passed as a parameter. Having done so, it can load no matter how many records sequentially and pass it to the given callback:
api.get(new RecordCallback() {
@Override
public void recordLoaded(Record record) {
System.out.println("Record: " + record + " retrieved");
}
}, "myCoolApp1", "logs");
This code gets all records from the “logs” table under “myCoolApp1″ domain. If we assume this “get” may be further sliced by offset/num we may use that e.g. in some view rendering it as it is loaded (long scrollable something or just a webpage). It could be also used in some reporting/processing code - “process a million of records from table X”. Looking deeper into the code it is a single big streaming JSON response. I mean internally iterators are used and the REST resource logic iterates through the set of record keys and streams record by record (JSON expressions line by line). It means that we are kind of unlimited in terms of the number of returned records. Compressed JSON payload rendered from even huge volume of data may be still not that much to send from the cloud to the end-client. Retrieval like that is not the simplest, but meets the “must scale” requirement, so essential in any cloud-computing software and so hard to accomplish sometimes. Of course having implemented that it is easy to wrap it by some other methods that slice the result set from i to j and return a collection.
The last alpha-snapshot is available in the Download section. Sample usage of that method is
here
Uncategorized
The current snapshot build brings the following items:
- Jersey and MINA versions upgraded, no snapshot deps
- JDBM persistence support added
- “count” functionality added
- a couple of minor fixes
JDBM seems to be sort of forgotten key/value persistence mechanism, but does its job being very light and simple and what is interesting is using B-Tree indices. It is very likely that functionality could be used to index data stored on data servers on top of or next to Lucene. MINA and Jersey are being intensively developed that’s why got matured in the meatime letting us avoid snapshot dependencies.
Changing topic… I was curious how many lines SubRecord has and I performed line counting, see the report:
przemek@master:~$ perl cloc-1.08.pl ~/workspace/subrecord-trunk/src/
115 text files.
115 unique files.
446 files ignored.
http://cloc.sourceforge.net v 1.08 T=1.0 s (113.0 files/s, 8654.0 lines/s)
-------------------------------------------------------------------------------
Language files blank comment code scale 3rd gen. equiv
-------------------------------------------------------------------------------
Java 109 1015 2696 4787 x 1.36 = 6510.32
XML 4 10 41 105 x 1.90 = 199.50
-------------------------------------------------------------------------------
SUM: 113 1025 2737 4892 x 1.37 = 6709.82
-------------------------------------------------------------------------------
Believe or not, there are 4787 lines of Java code !
Not bad.
Uncategorized
This weekend brings another snapshot build. I’ve been working mainly on refactoring (exceptions hierarchy) and optimization of critical components like network connectivity, indexing and routing. After a few fixes performance has decreased to something like 80 TPS on one node. I was doing my best to work that out. Thankfully it’s been improved. Tests now run smoothly, all is lighter and more robust.
I have performed a benchmark with JMeter and results finally look promising. On a single node throughput was 659.2 TPS for put w/ Lucene indexing. 999 concurrent threads were storing 20 records. One thing is stability under such load that proves it must be sort of decent and another of course the result itself.
Uncategorized
I’ve just uploaded 0.2.2 snapshot build. What’s been done since the last weekend:
- caching aspect
- components/internal architecture redesigned
- serialization using JBoss-serialization library
- Java API
- replication logic implemented using consistent hash algorithm
- JGroups discovery added - from now on, data servers ping master and are noticed within the cluster
- tremendous amount of refactoring
Uncategorized
Revision 400 has been reached and what we’ve got here is:
- downloadable bundle that can be run on Windows and Linux
- sample scripts for PHP and shell CURL usage that can put records via HTTP interface
- support for two persistence mechanisms so far
- BerkeleyDB JE
- Files storage
- performance achieved on my box (local master and local data server): 196 TPS on record put with underlying BerkeleyDB persistence, 500 concurrent threads, logging on, Lucene indexing on
Uncategorized