Constructional Profiling

The Russian verb грузить/gruzit’ ‘load’ is special for three reasons. First, this verb has two syntactic constructions it can appear in, second it has three Natural Perfectives, and third all three Natural Perfectives can also use both constructions. The two constructions that грузить/gruzit’ ‘load’ can appear in are called the “theme-object” construction and the “goal-object” construction. The names of the constructions come from the direct object that is marked with the accusative case. Let’s say that we have some boxes that we want to transport and a cart that we can use for this purpose. The boxes are the theme (the item that is put somewhere) and the cart is the goal (the place where the item is put). In the theme-object construction the theme is the direct object, as in грузить/gruzit’ ящики на телегу ‘load the boxes onto the cart’. The goal appears in a prepositional phrase in the theme-object construction, usually with the preposition на ‘onto’ or в ‘into’. In the goal-object construction the goal is the direct object, as in грузить/gruzit’ телегу ящиками ‘load the cart with boxes’. The theme in the goal-object construction often appears in the instrumental case as in our example: ящиками ‘with boxes’. Грузить/gruzit’ ‘load’ uses not just one, but three prefixes to form Natural Perfectives: na-, za-, and po-. Collectively we call these four verbs (the simplex and the three Natural Perfectives) “the ‘load’ verbs”. All three Natural Perfectives can appear in both the theme-object and the goal-object constructions. Chapter 4 explores whether the choice of prefix makes a difference in the distribution of the theme-object and goal-object constructions.

Statistical analysis

This website gives you access to the data and the statistical analyses that are reported in Chapter 4. This way you can both inspect the data and run the analyses yourself on your own computer. We will guide you through the steps and provide some commentary, but we cannot provide a comprehensive introduction to logistic regression here. If you are interested in learning more about using statistical analysis of linguistic data, we recommend that you consult the following books:

Baayen, R. Harald. 2008. Analyzing Linguistic Data: A Practical Introduction to Statistics Using R. Cambridge: Cambridge UP.
Cohen, Jacob, Patricia Cohen, Stephen G. West, Leona S. Aiken. 2003. Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences, 3rd ed. Mahwah, NJ: Lawrence Earlbaum.
Gries, Stefan Th. 2009. Statistics for Linguistics with R. Berlin: Mouton de Gruyter.
Johnson, Keith. 2008. Quantitative Methods in Linguistics. Blackwell: Malden, MA.

How to download R

You can download the R statistical software package to your computer from the R project webpage. You will also need to instal in R three packages called "LanguageR", "Hmisc" and "rms". You can do that by entering in R a line that looks like this: >install.packages(c("languageR","Hmisc","rms"), repos = "http://cran.r-project.org").

How to download and run the files from this website

On this webpage we offer you two types of files that you can download to your computer. Here they are: Ch4.R, Ch4data.csv. You can download these files by right-clicking on the links on this page. One of the files has the ".R" extension. This is an "R script". The R script contains all the commands that R needs in order to run the statistical test. You can open the R script if you like and see all the commands. We provide commentary on each command in lines that begin with the "#" symbol (R itself ignores all these lines) in order to help you follow along. The other file has the ".csv" extension and contains a dataset. The R script performs the statistical analysis on this dataset. If you want to look at the dataset, you can open it with Microsoft Excel. It is important that you download both of these files to your home directory in your computer so that R can find them. If you do not know where your home directory is, you can also copy and paste the R commands from the scripts directly into the R window (see "Alternative methods for running R scripts" below). However, you will need to tell R where to find the .csv file with the dataset, by giving it the correct path, and you will have to put this into the R code.

How to run the files from this website in R

After you have downloaded both the R script and the dataset to your home directory, you can open the R program on your computer. At the ">" prompt, type in: source("") and put the name of the R script you want to run between the quotation marks. For example, you can enter a line that looks like this: > source("Ch4.R") When you hit the return key, R will run the R script and give you all of the results as output.

Alternative methods for running R scripts

If you simply click on the links with the R scripts, you can then copy and paste all of the code into the R window and R will run the commands and give you the same results, but this will only work if the .csv file is in your home directory. Another option is to download both the R script and the .csv file to any location in your computer you want to and provide the path to the R script when you use the source command. If you use this option, it is necessary to put both items in the same place so that the R script can find the .csv file. For example, you can enter a line that looks like this: > source("/Users/janedoe/Downloads/Ch4.R") for Mac users or > source("C://Documents/Ch4.R") for PC users. If you do not know the path, you can drag and drop that item into an open R window placing it after the cursor prompt ">". When you do this, R will tell you what the path to the file is and you can copy and paste that into the source command.

The rest of this webpage will describe the database and give some guidance on how to interpret the results of the analysis that you will get from running the R script.

The database

There are 1920 lines of data, each corresponding to one of the examples extracted from the Russian National Corpus. This file does not contain the actual examples, but rather just the relevant data on each variable for each example. If you open the .csv file, you will see that there are four columns, corresponding to our four variables:

CONSTRUCTION: This is our dependent variable, and it has two values, theme, and goal.

VERB: This is an independent variable, and it has four values, _zero (for the unprefixed verb грузить|gruzit’ ‘load’), na, za, and po (for the three prefixed variants).

REDUCED: This is an independent variable, and it has two values, yes and no. This refers to whether the construction was reduced (yes) or full (no).

PARTICIPLE: This is an independent variable, and it has two values, yes and no. This refers to whether the construction was passive (yes) or active (no).

The database

The first thing the R script does is to give you a summary of the dataset. Scroll up to the top of the results and you will find a table that looks like this:

CONSTRUCTION	VERB	REDUCED	PARTICIPLE
goal : 871	_zero:393	no :1353	no : 895
theme:1049	na :368	yes: 567	yes:1025
	po :703
	za :456

This table tells you how many items of each type are in each column of the dataset.

Next comes the logistic regression analysis, which you find under the heading “Logistic Regression Model” in the R output. We used a procedure (following Baayen 2008 and Gries 2009) for discovering the minimal adequate model for our data. This means that we started with a hypothetical model in which all independent variables serve as both main effects and have interactions with each other, and then we progressively stripped away those that were not significant until we arrived at a model that represented only significant relationships. We will not walk you through this whole procedure, but just show you the optimal model. This model has all of the independent variables as main effects, plus an interaction between the VERB and PARTICIPLE variables. The formula for this model is represented this way in your R output (and in the R script):

lrm(formula = CONSTRUCTION ~ VERB + REDUCED + PARTICIPLE + VERB:PARTICIPLE, data = loaddata, x = T, y = T, linear.predictors = T)

This can be stated in prose thus: “CONSTRUCTION varies according to VERB, REDUCED, and PARTICIPLE as main effects, and an interaction between VERB and PARTICIPLE.”

Next comes a little table telling you the overall number of items for each value for the dependent variable CONSTRUCTION: goal has 871, and theme has 1049.

Next come some figures that indicate how well the model performs. We will interpret only some of them. For more discussion, see Cohen et al. 2003.

Obs

1920

This is the number of observations = lines in the dataset.

Model L.R.

1738.47

This is the LL-ratio χ2, the difference between the two deviance values, with and without predictors. In other words, this is a test comparing how our model performs compared to a default model without any predictors at all. Our predictors do a good job, and we get a high value here.

d.f.

8

This tells us that there are eight degrees of freedom in our model.

P

0

This tells us that the overall p-value for our model is zero. In other words, this is a calculation of the likelihood that we would find a sample with this strong a deviation from a random pattern or even stronger if there were no pattern at all in a potentially infinite population of examples of ‘load’ verbs in Russian.

C

0.964

This is the coefficient of concordance, which according to Gries (2009) should ideally be 0.8 or higher. The maximum here is 1.0 so this is a high value.

Dxy

0.928

This is Somer’s Dxy , the rank correlation between predicted and observed responses. The maximum here is 1.0 so this is a high value.

R2

0.796

This is Nagelkerke’s R2, which tells us the correlational strength in terms of the amount of variance that is accounted for by the model. The maximum here is 1.0 so this is a high value

Next comes a table that has these headings: Coef, S.E., Wald Z, P. The rightmost column here lists the p-value for each predictor variable. Most of these are zero, which indicates that they are highly significant, but this doesn't give us a lot of detail, which is why we will also use another model for the logistic regression analysis below, which gives us some additional information.

Next comes a table with the headings Factor, Chi.Square, d.f., P. This is an ANOVA analysis comparing how the various factors perform. We see that VERB is strongest, next comes PARTICIPLE, next is VERB*PARTICIPLE (the interaction), and next is REDUCED.

Next comes an alternative way of calculating the logistic regression model, by using the binomial version of the general linear model (glm). This gives many of the same values, but also gives us some different insights into the data. Under Coefficients, in the column for Pr(>|z|) we get p-values for correlations with the various values of the variables.

Under "These are the confidence interval values:", we get a 95% confidence interval for all of the variable values. Note that none of these confidence intervals spans 0.0, which indicates that they are all possible predictors.

Under "These are the odds of success for each predictor variable:", we get the odds that each predictor value has of predicting a correct outcome in our model. This is another way of ranking the predictor variables.

	Obs
	1920
	This is the number of observations = lines in the dataset.

	Model L.R.
	1738.47
	This is the LL-ratio χ2, the difference between the two deviance values, with and without predictors. In other words, this is a test comparing how our model performs compared to a default model without any predictors at all. Our predictors do a good job, and we get a high value here.

	d.f.
	8
	This tells us that there are eight degrees of freedom in our model.

	P
	0
	This tells us that the overall p-value for our model is zero. In other words, this is a calculation of the likelihood that we would find a sample with this strong a deviation from a random pattern or even stronger if there were no pattern at all in a potentially infinite population of examples of ‘load’ verbs in Russian.

	C
	0.964
	This is the coefficient of concordance, which according to Gries (2009) should ideally be 0.8 or higher. The maximum here is 1.0 so this is a high value.

	Dxy
	0.928
	This is Somer’s Dxy , the rank correlation between predicted and observed responses. The maximum here is 1.0 so this is a high value.

	R2
	0.796
	This is Nagelkerke’s R2, which tells us the correlational strength in terms of the amount of variance that is accounted for by the model. The maximum here is 1.0 so this is a high value