Home - MyDatasets - QuickView - Help

Data Format Help

Overview

This document provides detailed documentation of the MNV data format used to provide data to the EZChooser client. For a given set of items, the MNV data file defines attributes and values, much as in a relational database or table. It goes beyond basic data definitions by also providing a means of defining presentation and typing information as well as some initialization parameters. Presentation information might include, for example, the image that would be associated with a given item as well as an associated URL link. It is also possible to include typing and presentation information for attributes and values. Parameter initialization in the data file is essentially an alternative to providing these parameter specifications in the applet HTML file, which may be more convenient in some cases.

This data format is an augmentation of the standard CSV (comma-separated values) format, used, for example, in spreadsheet programs such as Microsoft Excel. In fact, EZChooser 1.1 parsers will accept an ordinary CSV file. Such a file will specify the basic data values, and the MNV Compiler program will supply default presentation methods and default initialization parameters.

The enhanced CSV format is called MNV (MultiNaV) format. It encapsulates the ordinary CSV data specification inside a set of <DATA> tags. It then allows for two additional sections for rules and initialization. Thus there are up to three optional sections, appearing in any order, within the file:

  • Data section, enclosed in <DATA> </DATA> tags. The format of this section is the same as a plain CSV file. See the Data Section.
  • Rule Section, enclosed in <RULES> </RULES> tags. The format of this section is described in the Rule Section.
  • Initialization Section, enclosed in <INIT> </INIT> tags. The format of this section is described in the Initialization Section.

Data Section

The data section of an MNV file must be enclosed with the <DATA> </DATA> tags. The specification within the tags is identical to a standalone CSV file with no tags. As in standard CSV, the fields are separated by commas (',') and new lines indicate a new table row. Those fields that have commas within their value strings should be enclosed within double quotes ("").

EZChooser will interpret the first row as a header row, each field being the column name. Subsequent rows are data rows. Thus each column defines a distinct attribute, a.k.a. feature or dimension, while each row defines an item.

Each row in the data section should have the same number of fields. If there are rows that have different number of fields, the parser will make no effort (such as appending blank fields) to make up the missing fields, and an error message will be issued.

Data types

The data type of each dimension may be inferred from the values of that column. Here are the data types supported in EZChooser: string, boolean, integer, float, currency, date. The rules for inferring dimension data types are as follows. They are fired in the specified sequence:

  1. If all the strings in a column start with '$', that column is interpreted as currency type.
  2. If all the strings in a column have the values of true, false, on, off, yes, no, that column is treated as a boolean Dimension.
  3. If all the strings can be parsed to be integers, then that is integer Dimension.
  4. If all the strings can be parsed as floating point numbers, then that column is float Dimension.
  5. Otherwise, the Dimension is defaulted to be string type.
Note that date type is not inferred, but must be explicitly stated with a rule as described below.

Data typing specified for the dimensions in the rules section override any type inferences described above.

Example data section

Here is a simple example:

<DATA>
 Make,Model,Consumer Guide Recommendation,Class, Price
 Chevrolet,C/K 2500/3500,No recommendation,full-size pickup, $15000
 Chevrolet,Silverado 1500,Best Buy,full-size pickup, $17000
 Ford,F-250/350 Super Duty,Best Buy,full-size pickup, $20000
 Toyota,Tundra,Recommended,full-size pickup, $19000
 </DATA> 

This example is equally valid if no <DATA> tags are included and no other tags are within the file.

Rules Section

The rules section must be enclosed in <RULES> </RULES> tags. The rules section is designed to provide metadata and presentation information for the tabular data in the Data Section. Functions include:
  • Rules for specifying presentation information for Space (the overall data set), dimensions and their units of measurement, items, and values.
  • Dimension and item IDs used for identification in initializing EZChooser.
  • Special text formatting for fields in the data used for parsing and generation (numbers and dates, for instance).
  • Descriptions used in presentation of the 'items', such as glyph representation as well as singular and plural nouns.

Basic rule syntax

The basic syntax for rules associates a key with a value. For example, to associate a dimension's text label to a particular text string, we would write:

dimension.4.label.text=Vehicle Class 

The part to the left of the "=" sign is the key, and the part to the right is the value. A dot notation on the key signifies a hierarchical property decomposition in the usual sense. String values to the right of the "=" do not have to be quoted in general. Note that a rule has to be specified on a single line of the file. We have broken the lines in this document at times for readability only.

Reference to columns

Many rules allow references to a column in the data table. For instance, in the example rule above, the "4" makes reference to column 4 in order to identify a dimension. All column references are "1-based," (the count starts from 1 rather than 0).

Column numbers may also be used in right-hand-sides of rules, and this is typically used for specifying item presentation values. For instance, if you want to specify that the text label for items is to be found in column 2, then you would write a rule like this:

item.label.text=2

If a column reference appears in a rule right-hand-side, then the parser will assume that the column referenced is not normal dimension data, and it will not be shown as such. That is, it will not be presented as feature data in EZChooser. However, if the user explicitly specifies a dimension in the left-hand-side of a rule as, for example, below, then it will be treated as such. That is, because it appears in the left-hand-side of the rule, dimension 2 will be shown as a EZChooser row despite the fact that it also appears as a column reference on the right-hand-side of a rule.

item.label.text=2
dimension.2.label.text=3

String variables

For right-hand-sides of rules, there is a convention offered for substitution of variables within strings. If a value specification surrounds an integer with up carets ("^"), then this will be interpreted as a reference to a column. This value will be substituted into the string. An example is shown here:
item.label.image=carimages/^17^.jpg 

The above rule states that for an item label's image (it takes a URL string as value), substitute the value in column 17 for that item in place of "^17^".

Typing

All dimensions are assigned a type. If a type is not explicitly assigned through a rule, then the type will be inferred as explained in the data section. Rules can assign any of the types string, boolean, integer, float, currency, date as in the following example.

    dimension.7.datatype=integer

Presentation information

The overall space (data set) in EZChooser can have presentation information specified, as can items and dimensions. For items, presentation information is used in the lower half of the screen area where items are listed that match the value restrictions in the dimension screen area (upper half). For dimensions, presentation information is used in the rendering of dimension (feature) rows. Dimension presentation includes the dimension labels, units of measurement, and cell values (text on buttons). In general, "text", "image", and "url" presentations can be specified in both label and detail categories. There are also some other special cases such as icon (glyph) drawings and nouns to use in descriptions of the item sets.

Note that not all fields in the spec are currently being utilized by the EZChooser applet (Version 1.1), although we anticipate changes in the future. In particular, EZChooser Version 1.1 is not rendering information related to spaces or detailed presentations of any kind.

Item presentation

Here is an example of specifying item presentation:

item.label.text=2
item.label.image=carimages/^17^.jpg
item.label.url=
    http://cg.superpages.com/cgi-bin/php/new/reports/full/intro?CarId=^19
item.detail.text=18

In each of the rules above, a column number is referenced in the right-hand-side, which is typical for specifying item presentation.

  • The first line states in what column to find the label presentation for an item. The item labels appear below their images in the bottom half of the screen in MultNav 1.1.
  • The second line is an example of a relative URL spec for images. The complete URL path will be constructed from this relative URL and the host directory on a server from which the applet is retrieved. Note the use of variables. (Column 17 is substituted into this string for each item label URL.)
  • The third line specifies the URL link for the item. In EZChooser Version 1.1, this is the link followed when a user clicks on an image in the bottom half of the screen.
  • The last line specifies the column for detailed text information. As mentioned, detailed presentation information is not currently being rendered in EZChooser 1.1.

Item descriptions

The keyword 'itemdesc' allows additional specification of how to describe an item. at given points in the application. Are they vehicles, digital cameras, or other things? The glyph (icon) definition is to give a graphical representation of the objects, so that different icons for vehicles, cameras, etc. can be used. For example, here is how we have specified this extra information for a dataset of cars:

itemdesc.noun.singular=vehicle
itemdesc.noun.plural=vehicles
itemdesc.glyph.points=(1,8) (2,8) (2,9) (3,9) (3,8) (7,8) (7,9) (8,9) 
                      (8,8), (9,8) (9,7) (7,5) (3,5) (1,7)

Dimension presentation

The 'dimension' keyword is to specify what columns to include as a data feature in Multinav. The column numbers must be existing columns in the data section. Other attributes of a dimension, such as how its values are presented, its units, etc., can also be specified. For example, the next set of rules specifies presentation for a miles-per-gallon dimension:

   dimension.7.label.text=City Fuel Efficiency 
dimension.7.unit.label.text=Miles/Gallon
dimension.7.label.url=http://hostname.com/aboutMPG.html

The first two lines above should be straightforward; they specify dimension and unit presentations that appear at the left edge of a dimension row. The third line specifies the column in which to find the URL link for that dimension, which will be presented as an underlined link on the dimension text label. This feature is intended to provide a hook for explanatory information about dimensions.

At times it may be important to explicitly specify the presentation of values. Note that values provide the basis of sorting items in each dimension row in EZChooser. Thus it may be important to preserve this underlying value for sorting purposes but still allow a presentation string that is different. An example might be a dimension such as screen resolution. Here the application designer may want the underlying value to be the total number of pixels (an integer) but the presentation to be the string "width X height." If so, you would specify it as in the following example:

    dimension.8.unit.label.text=Screen Resolution
    dimension.8.value.text=9

where column 9 contains entries like "1024X768".

Note that if you do make use of this convention for specifying labeling on dimension values, it is only possible for a single value to have a single text presentation. In other words, you cannot map the same value to different presentation strings even though different items may be involved.

Special formatting

Text formats can be specified for both parsing the input file and presenting within EZChooser. The keyword for parsing the input file is "format" and the keyword for presentation within EZChooser is "presentationFormat."

An example of the use of a formatting instruction in rules follows. This rule says that float values in dimension 10 should be presented with one decimal place and a comma every third digit to the left..

dimension.10.datatype=float
dimension.10.presentationFormat=#,##0.0 

Here is another example for dates. This combination of rules says that the format for dates in the mnv file is the numerical slash format pattern month/date/year, e.g., 05/02/99. However, the presentation of the value in EZChooser dimension rows should be year only with four digits, e.g., 1999.

dimension.9.datatype=date

dimension.9.format=MM/dd/yy
dimension.9.presentationFormat=yyyy 


The format patterns that can be specified should conform to standard Java format patterns. This convention is particularly useful for numbers, currency, and dates. Several simple cases are shown below.

  • Floats showing a format to one decimal places, and no decimal places, respectively. The Java pattern language being used here is somewhat obscure. "0" indicates a required digit. "#" indicates an optional digit. It's easiest if you just follow these examples for the basic cases.
float number 1234.56 <#,##0.0> --> 1,234.6
             1234.56 <#0>      --> 1235

  • dates showing hours and day, month, year, respectively.
date field "h:mm a"            --> 12:08 PM
           "EEE, MMM d, ''yy"  --> Wed, July 10, '96

IDs

Items and dimensions can have IDs specified. IDs are used to help initialization of the Multinav Navigator, so that in either the data file or in applet parameters, the user can easily specify which dimensions are to be displayed and in what order and which items are to be marked initially. For example, you may want to assign an ID to a dimension as follows:
dimension.6.ID=high price attribute

Now you may (in fact you have to) refer to this attribute via the string "high price attribute". One place this is commonly used is in the initialization section, where one specifies the order in which dimensions are presented. Dimensions may be referred to by position if no ID is specified.

Items have their IDs specified via a column reference. A typical example would be

item.ID=7

which would indicate that each item's ID is in the cells of column 7.

Clustering of values

It is often advantageous to have the compiler aggregate or cluster values within dimensions. As data sets get larger, this gets more and more critical--a user could, for instance, be presented with just 7 buttons for value ranges that could stem from 100 different values in the original data. The algorithms for aggregation are type specific.

  • For 'string' dimensions, you will get the specified number of groups, each of which has (approximately) the same number of values.
  • For 'number' dimensions, it uses the weighted mean to calculate the groups, attempting a bottom-up best-first method to find the best clustering.
  • For 'date' dimensions, the representation is standardized to milliseconds and then a variant of the numeric clustering method is used.

Of the three types of clustering supported, we imagine that string-based clustering may be less useful than numeric and date types. Here is an example of how to invoke clustering. You specify the number of clusters you would like for individual dimensions. (7, plus or minus 2, seems to be a good target for number of clusters.)

dimension.9.NumberOfClusters=7

A good tip is to make use of the special formatting in combination with clustering. The compiler will respect the formatting instructions when it creates the strings representing value ranges. For instance, if you want a date dimension to just include a two-digit year with apostrophe when it prints out values, you may use an instruction such as this:

dimension.9.datatype=date
dimension.9.presentationFormat=''yy 
dimension.9.NumberOfClusters=7

The resulting presentation on a value button would be something like this:

'77-'89   

Complete BNF spec

The complete BNF specification for the rules section follows:

rules := (rule)*
rule := ruleKey, assignmentOperator, ruleValue, lineSeparator

ruleKey := spaceKey | dimensionKey | itemKey | itemDescriptionKey

spaceKey := "space".classifiedPresentation
dimensionKey := "dimension".columnNumber.("ID" | "datatype" | "format" | 
                "ignore" | "NumberOfClusters" | classifiedPresentation | 
                unitPresentation | valuePresentation)
itemKey := "item".("ID" | classifiedPresentation)
itemDescriptionKey := "itemdesc".(nouns | glyph)

unitPresentation := "unit".classifiedPresentation
classifiedPresentation := labelPresentation | detailPresentation;
labelPresentation = "label".presentationType;
detailPresentation = "detail".presentationType;
valuePresentation := "value".presentationType;
presentationType := "text" | "image" | "url"
nouns := "singular" | "plural"
glyph := (point, separator, point, separator, point)+ , (separator, point)*
point := x, separator, y
serarator := "," | " "

assignmentOperator := "=" | enclosedAssign
enclosedAssign := ""=""

ruleValue := fixedValue | variableValue | dataTypeKeys |
             Dimension-specific information
fixedValue := arbitraryText | formatSpecification | urlString
variableValue := (prefix ^)? columnNumber (^ suffix)?
dataTypeKeys := "string" | "boolean" | "integer" | "float" | 
                "currency" | "date"

The matching between keys and values should be obvious, such as datatype key should only have dataTypeKeys as its value. Here are some points that can not be seen from above.

  • variableValue can be specified only for itemKey, valuePresentation and item ID, otherwise, it is not parsed and will be treated as a literal string. Literal strings can be used for prefix and suffix. Here are several special cases to note when specifying variableValue:
    • If it is desirable to use a constant string as the value for either itemKey, valuePresentation or item ID, enclose it inside double quotes ("").
    • When there are two terms specified, and both terms happen to be numbers, the real column number should be enclosed in parenthesis ("()"), then the other number is treated as prefix or suffix automatically.
  • To eliminate ambiguity, enclosedAssign could be used when there are other equal signs ("=") in this rule.

 

Example rule section

Here are some examples of rules with accompanying comments:
<RULES>
#The following two rules specify text labels for columns 4 and 5
dimension.4.label.text=Vehicle Class dimension.5.label.text=Base Price #An example of type specification
dimension.5.datatype=currency #ID specifications can be any text string (they default to column number)
dimension.5.ID=low price attribute #Specifies the text for labeling the dimensions unit of measurement
dimension.5.unit.label.text=Dollars #Next set of rules is typical of a currency dimension
dimension.6.label.text=Loaded Price dimension.6.datatype=currency dimension.6.ID=high price attribute dimension.6.unit.label.text=Dollars dimension.6.format=0 #Next set of rules is typical of an integer-valued dimension
dimension.7.datatype=integer dimension.7.ID=City Milage dimension.7.label.text=City Fuel Efficiency dimension.7.unit.label.text=Miles/Gallon #Next set of rules an example of a float-valued dimension
dimension.10.unit.label.text=inches dimension.10.datatype=float dimension.10.presentationFormat=#,##0.0
#This url spec specifies a link associated with the text label
dimension.14.value.url=14 #Following are specifications typical for items. Note the use of variables.
item.label.text=2 item.label.image=carimages/^17^.jpg item.detail.text=18 item.label.url= http://cg.superpages.com/cgi-bin/php/new/reports/full/intro?CarId=^19 item.ID=19 #These rules specify values to use in item descriptions
itemdesc.noun.singular=vehicle itemdesc.noun.plural=vehicles itemdesc.glyph.points= (1,8) (2,8) (2,9) (3,9) (3,8) (7,8) (7,9) (8,9) (8,8) (9,8) (9,7) (7,5) (3,5) (1,7) </RULES>

Error checking

Errors being checked

  • Non-positive or non-numerical column number
  • More than one data sections
  • More than one rule sections
  • More than one initial parameter sections
  • A section does not have a closing tag before end of file
  • Data Section is empty, Space can not be generated
  • Invalid rule: not a valid key=value pair
  • Invalid rule: There is no equal sign
  • Invalid rule: There are multiple equal signs. Use \"=\" to make it explicit
  • Invalid keyword: keyword does not match
  • Invalid keyword: extra fields in the key
  • Invalid keyword: does not have all the required terms for this rule
  • Invalid rule for variable (row dependent) presentation
  • Ambiguity when two numbers are specified, use () to designate the real column
  • No valid column number can be parsed from the two terms for the variable (row dependent) presentation
  • There is no such column in the data, column number too large

Errors not being checked, but should be

  • invalid data type specification
  • glyph definition checking
  • singular noun must be specified if plural is present

Initialization Section

These are the initialization parameters that can optionally be specified in the data file and/or as applet parameters. What is specified through applet parameters will override that in the data file:

  • SelectedDimensionsInOrder = comma separated dimension IDs that are to be displayed.
  • MarkedItems = (item ID, color pattern), (item ID, color pattern) ...
If a parameter is not specified in either places, the default value is:
  • SelectedDimensionsInOrder: all dimensions will be displayed in random order.
  • MarkedItems: no initial markers when applet gets loaded.

Example initialization section

The following example mixes the conventions for dimension reference. Those dimensions that have been named, such as "low price attribute," are referred to as such. Where dimensions have not been named, they are referred to by position. There should be no line-feeds in such a spec.

<INIT>
 SelectedDimensionsInOrder=Dimension 1, low price attribute, high price 
 attribute, City Milage, Highway Milage, Dimension 3, Dimension 9, 
 Dimension Dimension 11, Dimension 12, Dimension 13, Dimension 15, Dimension 16 
</INIT>


Copyright © 2000 Verizon Laboratories Inc. All rights reserved.