Caltech Bookkeeping System
About
Caltech Bookkeeping system is a home-grown, extremely simple but a bit versatile solution for keeping track of the private samples of our group that are not possible to be stored in the actual DBS - Dataset Bookkeeping System.
There are 2 use cases according to the users :
- Register datasets that are merely under a directory -- we assume that everything there will be part of the dataset. This should be 80% of the cases.
- Register datasets that are actually filesets -- 20% of the cases.
Extra thoughts :
- Samir thought about including in the metadata, location as being local or remote, as local files can benefit from extra metadata from Hadoop FSCouch. This will enable us to account for file sizes when applicable.
- Remember, the Computing Team's focus is to support our different clusters. This was done in best possible effort mode and will be supported as such. Let's try to stick with what we need and basic functionality
- User input here -- feel free to edit the wiki!
We chose
CouchDB as the backend database due to its portability and advantage of using HTTP requests for all operations. That means that we will have a CLI and most of the data will be visible by browsers, users will also have direct access to the data if they wish to create their own scripts.
Release plan
- Beta -- Skeleton functionality but already functional. Full support in registering datasets but visualizing the data won't be totally straightforward. Doable though. Documentation on how to do what in this Twiki.
- 0.4 -- All basic functionality to all operations involving directories will be available. Should cover 80% of the cases
- 1.0 -- Will support filesets to the full extent. Tool can be considered as done
- 1.4+ -- Optional release. More advanced functionality such as considering file sizes from the local HDFS database and reporting dataset sizes to the users.
Beta release usage instructions
What we have available :
- Publication from t3-higgs.ultralight.org terminal
- Visualization of RAW data on your browser -- one can understand but is not pretty output.
- Needs to be done within research networks (CERN or Caltech networks will do)
Publishing a dataset
Should be simple as :
-bash-3.2$ cbs register HggHighPtSamirTest2 /store/user/samir/cernphysics/
ok!
You can also flag your dataset to be remote. This will be useful later :
-bash-3.2$ cbs register HggHighPtSamirTest3 /store/user/samir/cernphysics2/ --remote
ok!
Visualizing registered data
- To list all the datasets :
http://nagios.ultralight.org:5984/cbs-beta/_design/cbs/_view/listDatasets?group_level=1
- To list the association of dataset name to directory, all datasets :
http://nagios.ultralight.org:5984/cbs-beta/_design/cbs/_view/showAssociation
- To filter out information from a single dataset :
http://nagios.ultralight.org:5984/cbs-beta/_design/cbs/_view/showAssociation?key=%22HggHighPtSamirTest2%22
-- Main.samir - 2014-05-02
Topic revision: r2 - 2014-05-02
- samir