Jot notes for #meshU on data storage
Here’s a raw dump from Kate that I’ll format when I get home.
Freshbooks talk
DabbleDB
- While doing dev for Viaweb, took heat from pseudotechs like VCs and industry analysts for not using a relational database.
- Used files
- also took heat for using cheap PCs running FreeBSD
Two companies that used nonconventional data stores: Yahoo and viaweb
- Yahoo: because we must. one large data set.
- Viaweb: because we can. masses of tiny data sets.
- Facebook does run on MySQL
- still using a cloud the way they cluster their database
- we’re not going to do the same thing
- Open for mere mortals as a cloud: AppEngine, SimpleDB, SSDS (from Microsoft) – presenter is not a fan of MS but still likes it
- Common for all of these: much more restricted feature set than typical relational database
- No joins, grouping, aggregation. Restricted sorting and restricted CPU. Dynamic schema.
- Done because at this kind of scale, doesn’t make sense to do these
- Anything that’s going to require these requires a query planner that can translate
- To implement these restricted ones .. some will be fast, search on index, some will be a linear search
- Linear scan doesn’t make sense for large database
- Restrictions are to make sure that your queries will execute queries
- Loads of concurrent requests are fine. Get results all at once.
- Have to be able to group queries yourself. Good library support is essential. May need to build your own.
- Must do a lot more work on write than you are used to doing at read. De-normalization (oh no)!
- Store lots of information in one data field?
- remember that there won’t be cascading on update
- On caching: Rarely need to use something like memcached since there are no expensive query. Problem is that writes won’t take for up to a minute. Cache insertions.
- Tips: concurrency is your friend, get good at grouping/sorting in local memory, compute on updates
- Data management that via web chose: Load machine with RAM. Have _all_ of customer’s data loaded into memory while customer is working on it.
- Issue with this is you can’t load balance the queries, because the live updated data will only be on one server
- Another worry: data loss when you kill -9
- Slowly write changes to disk
- Look up Prevayler, java library
- Inspired code in other languages
- Formalizes the pattern of keeping pattern in memory. Changes get represented in a Command object.
- Serialize the Command objects into a transaction log
- Checkpoint occasionally (write up whole state of the world)
- Replay the Commands when needed
- Things crash? Minimal data loss due to frequent checkpointing.
- Jotspot used Prevayler, he thinks
- Tips: fulltext index covers a multitude of sins, linear scans can often make up the rest, again get good at grouping/sorting in local memory
- Seems getting good at grouping/sorting in memory is essential
- Techmeme fits into RAM! 600mb of data. Why bother caching database when you can cram the whole thing into RAM?
- What about: Transactions, load balancing, data size > RAM size?
- Two answers
- Transactions: probably don’t need, try to get rid of concurrency. If you partition finely enough it might have five people per partition, one or two people on at a time. Mutex to execute the command object. Serialize rather than have optimistic concurrency.
- Load balancing: Not that big a deal. Don’t need a customer balanced accross many servers, need your customers partitioned over enough servers.
- What happens when someone wants a gig of data? Can’t put that into memory. (Or maybe you can? 16gb servers are reasonable.)
- Second set of answers
** – Demo of MagLev – new product hasn’t been demoed before
- new Ruby implementation built from the ground up for scalable web applications with persistence and caching built in






