Prepare Data

Loading Data into CAT

You can upload and code a “raw dataset” from a file with text formatted in one of three ways:

  1. a plain text file (.txt)
  2. a zip archive of plain text files (.zip), or
  3. an XML file (.xml)

Data Preparation When Using a Single Plain Text File

CAT relies on predefined spans of text to enable the auto-loading of discrete items or what we call "codeable units" during the coding process. Unless you use one of two special delimiters, the system assumes you want to apply your codes to the entire document. If you upload a single plain-text file, the coding system will present the codeable units one at a time consisting of the text lying in between each pair of blank lines. The blank link is the delimiter. Please note: this blank line delimiter is only for the case of the single text file dataset.

When uploading a single .txt file as your raw data, prepare it as follows:

<text to be coded><hard return>
<hard return>
<text to be coded><hard return>
<hard return>
        

Data Preparation When Using a .zip Archive of Plain Text Files

If you upload a collection of plain text files in a .zip archive, the system assumes that each document is a "codeable unit." You can, however, insert a special delimiter in your raw data:

This delimiter allows you to upload a .zip archive of two or more files and still code at the sub-document level, rather than whole document level. As with the single text file, the span of text to be coded is up to you (e.g., a sentence, a question/answer pair, a paragraph, a speaker in a focus group, etc.). The delimiter has to be on a line all by itself - that is, you need to have:

Note: Be sure to save your raw data files as “plain text” (.txt) files. Data Preparation using an XML File
The system will verify that it conforms to the correct schema definition and process the file as such. This schema may be found at: http://cat.texifter.com/resources/codeupload.xsd

A sample XML document and tips on using the XML upload functionality can be found at: http://cat.ucsur.pitt.edu/resources/codeupload.xsd

Codeable units can be defined in ATLAS.ti as "free quotations" and loaded into CAT for coding via xml export. This is useful if coding at the sentence or multiple sentence level since the CAT interface reduces errors and decreases coding time relative to ATLAS.ti.

The following procedure illustrates how to export all of the quotations from an HU as codeable units in CAT.