## Tuesday, April 30, 2013

### Data Frames for GSL Shell

Lately I've made a lot of work to implement "General Data Tables" in GSL Shell. I choose this name to designate what is otherwise called DataFrame in GNU R or other environments.

The difference between data tables and matrices are:
• each column is identified by a name
• the data in each cell can be a number but also a string or be undefined
The fact that you can store strings in each cell is very useful, I guess everyone can understand the reasons, not all data is numeric.

In addition the fact that each column has a name greatly simplifies a lot operations since you can refer to the data by name instead of having anonymous columns identified by an index.

Here an example taken from the excellent |STAT user manual of Gary Pearlman.

 student teacher sex m1 m2 final S-1 john male 56 42 58 S-2 john male 96 90 91 S-3 john male 70 59 65 S-4 john male 82 75 78 S-5 john male 85 90 92 S-6 john male 69 60 65 S-7 john female 82 78 60 S-8 john female 84 81 82 S-9 john female 89 80 68 S-10 john female 90 93 91 S-11 jane male 42 46 65 S-12 jane male 28 15 34 S-13 jane male 49 68 75 S-14 jane male 36 30 48 S-15 jane male 58 58 62 S-16 jane male 72 70 84 S-17 jane female 65 61 70 S-18 jane female 68 75 71 S-19 jane female 62 50 55 S-20 jane female 71 72 87

The data above can be used to show some of plotting functions.

What is very interesting is that, having the data in tabular format, many operations becomes very easy. For example to create an histogram of the "final" column you can simply type:
> gdt.hist(ms, "final")

to obtain the following plot:

Given the data above you may wish to have a more expressive plot based on the teacher and the sex of the students. Here come to help the "gdt.plot" function I'm very proud of. You can use it very simply:
> gdt.plot(ms, "final ~ teacher, sex, student")

to obtain the following plot:
The function "gdt.plot" use a sort of mini language that let you specify what should be plotted (y variables) in term of which variables.

Something interesting is that the function figure out by himself if the x variable is a numeric variable of an enumeration like in the example above. In addition you can "layer up" more enumeration variables just like you can do with Excel's pivot tables.

The following form can be also used:

> gdt.plot(ms, "final ~ sex, student | teacher")

to create many lines grouped by the field teacher.

The mini language is actually quite flexible. You can use arbitrary mathematical expression, not just variable names. If you want you can try to discover yourself its possibilities. There is a specific chapter in the GSL Shell's user manual.

I hope this is interesting for you. In the next post I will talk about the linear regression function modelled after the GNU R's function "lm"...