1. Intro

1.1. General Data Manipulation Features

2 min

anatella offers you the following capabilities anatella is 100% unicode compliant and will accept any character set (chinese, cyrillic, japanese, etc ) without losing any information classical etl features o join tables o columns & rows filtering out of tables, o sorting, o format conversions (csv, sql , ), o automatic scoring (using the timi predictive models) o automatic segmentation (using the stardust segmentation models) o derivation of new columns for predictive modeling o automatic generation of hundreds of thousands of new, “derivate” columns ; o a full scripting language based on javascript (standard ecma 262) that allows you to express the most complex transformations, validations, aggregations, derivations the small anatella specific extensions to the “standard” javascript language are easy to use furthermore, thanks to these extensions, the javascript code becomes similar (but more versatile) to a "sas datastep”, so that you can even leverage your sas skills o complete meta data extraction & management inside anatella, most of the data transformation operators are " meta data free " it means that it is not necessary to define “metas datas” to use 99% of the various transformations available in anatella in this regard, anatella is very much like ms excel inside excel, you don’t need to specify “by hand” the data types of each of your columns or cells excel automatically find the right data types, so that the equations inside your sheets are working without you spending time to define any meta datas the same principle applies with anatella the " meta data free " functionality of anatella makes it completely different than all the other etl’s currently on the market it also greatly simplifies the usage of anatella the difficulty in using anatella is comparable to the difficulty of using excel (it’s only slightly more complex) this means that business users without technical training are usually able to use anatella without too much headaches the " meta data free " functionality is also important because, in predictive datamining, it is very common to manipulate tables of tens of thousands of columns and it is impossible to specify "by hand" the "meta datas" of all these columns (as required by nearly all other etl software) anatella features two highly optimized data file formats that have the extension “ gel anatella” (row based file format) and “ c gel anatella” (column based file format) these two file formats are primarily optimized for speed one “ gel anatella” file (or one “ cgel anatella” file) contains one data table (and all its meta data) usually you can read “ gel anatella” file at a throughput of 70 mb/sec (on common hardware) the data inside a simple “ gel anatella” file is compressed by a factor around 4 this means that reading a table out of a “ gel anatella” file at a speed of 70 mb/sec (compressed) is actually equivalent to reading the same table at a speed of 280 mb/sec (uncompressed) a very common situation in business intelligence is to compute aggregates based on a subset of the columns available inside the data file aggregates are typically computed using only about 4 columns out of the many columns stored inside the data file the columnar file format “ c gel anatella” allows you to just extract, out of the hard drive, the few bytes that are composing the 4 columns required to compute the aggregation (and avoid reading\&decompressing the bytes that belong to all the other columns) in practice, this means that, for most business intelligence tasks, the “ c gel anatella” files are usually from five to hundred times faster than the simpler “ gel anatella” files (at the cost of a slightly higher ram memory consumption)(simply because you avoid reading the data/columns that you don’t need to read) these performances place anatella ahead of most competitive offerings (as evidenced by its score on the tpc h benchmark) anatella can read in "native mode" compressed dataset files in text format (csv) the supported compression formats are rar (unique!), zip, gz, z, lzo this functionality allows to reduce the need for hard drive space on your server let’s give an example the open source “census income” database stored in a rar compressed text file “weights” 4 04 mb the same database in an un compressed text file “weights” 96mb (let’s say around 100mb) the same database in a classical sas sas7bdat dataset file “weights” around 250mb the numbers given on this example are quite common and perfectly illustrates why an etl should be able to process compressed data streams (even more so when dealing with “ big data ”) anatella reads natively sas datasets file ( sas7bdat file), spss datasets files ( sav and por files) and stata ( dta files) datasets files anatella is heavily multi threaded this means that one data transformation running in anatella can exploit all the cpu’s inside your server to decrease the computation time classical etl’s are only able to run different data transformations on different cpu’s but not one particular data transformation on many cpu’s the multi threading capabilities of anatella usually allow dividing the computing time of a data transformation graph by a factor between 4 to 10 anatella offers you a direct access to all the “classical” relational databases via odbc & oledb connectors (oracle, sqlserver, mysql, terradata, ) anatella provides some crude olap reporting functionalities through the use of a “microsoft office data injection operator” this operator allows you to automatically inject “in batch” some data extracted from the anatella graph into any chart or graphics contained in any microsoft office document for example, you can obtain, in a few mouse clicks, each day, an automatic update of all the charts of your preferred powerpoint presentation anatella can generate all types of msoffice graphs pie chart, 3d surface chart, bar chart, doughnut chart, bubble chart, etc