Folks,

In this blog we will explore the different use of **MEANS** procedure in SAS.

Descriptive Statistics such as the *Sum, Average, Minimum, Maximum, Range, Standard deviation* etc. provide useful information about numeric data (Numeric Variable).

SAS **MEANS** procedure also provides helpful options for controlling your output.

**Means Procedure Basics:**

SAS **MEANS** procedure a way to generate summary reports. Descriptive Statistics such as the sum, min, max and means of your numeric variables data. However, Means procedure is much more versatile and can be used to create output summary data sets, which can be used in other DATA or PROC steps in SAS.

Procedure Syntax-

proc means<Data=SAS-Data-Set><statistic-keyword(s)><option(s)>;varvariable(s);byvariable(s);classvariable(s) <option(s)>;idvariable(s);output<output-specification(s)>;

*Where*

is the name of the SAS data set to be used for Means Procedure.**SAS-data-set**specify the statistics to compute eg. min max mean sum etc.**Statistic-keyword(s)**control the content, output analysis & appearance of output.**Option(s)**identify the analysis variables and their order in the results.**VAR**calculate separate statistics for each BY group.**BY**identify variables whose values define subgroups for the analysis report.**CLASS**include additional identification variables in the output dataset.**ID**create an output dataset that contains specified statistics and identification variables.**OUTPUT**

**Identify Missing Values**:

Suppose we have SAS dataset **STUDENTS** having 14 Observations, 3 Numeric type & 2 Character type variables. (*See below Content Procedure output*)

We have submitted below Means procedure code for STUDENT data set without providing any statistics keywords & options.

proc means data=work.students; run;

Here is the Output of Means Procedure.

So by default as shown above, the **MEANS** procedure produce **N*** ( the number of non-missing observations)*,

**Mean**,

**Standard Deviation**,

**Minimum**and

**Maximum**for all numeric variables in the input SAS dataset.

Using **Options **we can request additional statistics. Remember when statistics options are added, you must include those default requests if required. Again, only numeric variables can be added in the var statement.

As we have already mentioned that **STUDENTS **dataset has 14 Observations. So here we can say that Variable **STUDENT_AGE** has some **missing** values in it, as it’s count is not 14.

**Validate Numeric Data Range**:

MEANS procedure can also be used to validate the numeric data because it produces summary reports displaying descriptive statistics (min, max & std).

It can show whether the values for a particular numeric variable are within their expected range or not.

Example:

proc means data=work.students; run;

Output for the **MEANS** procedure displays a range of 27 to 399 for **STUDENT_AGE** variable, which clearly shows that there is invalid data somewhere in the **STUDENT_AGE** column. Here we can say that data cleaning is required.

**Additional Statistics & VAR Statement:**

As we have already mentioned that PROC MEANS prints**n-count** (number of non-missing values), **mean**, **standard deviation**, **minimum** and **maximum** values of every **numeric variable** in a input SAS data set.

We can control which variables to include in the report by supplying a **VAR** statement.

Also selecting **options** in the PROC MEANS statement** **we can request additional statistics. Remember when statistics options are added, you must include those default requests if required.

Again, only numeric variables can be added in the var statement.

**Example** : Here we are requesting **n std range skewness kurtosis** statistics in Means procedure output for only **STUDENT_AGE** variable in **STUDENTS** SAS data set.

proc means data=work.studentsn std range skewness kurtosis;varstudent_age; run;

### Group Processing – CLASS Statement:

It is used to categorize data in the output. It can be either character or numeric, but they should contain discrete values. If a CLASS statement is used, then the **N Obs **statistic is calculated which is based on the CLASS variables.

**CLASS** variable(s);

where variable(s) specifies category variables for group processing.

proc means data=work.students n max min std range q1 q3 qrange; var STUDENT_WEIGHT STUDENT_HIEGHT;classSTUDENT_GENDER; run;

**Group Processing – BY Statement:**

Like the CLASS statement, the BY statement also specifies variables to use for categorizing observations

**BY** variable(s);

where variable(s) specifies category variables for group processing.

Note: You have to first sort your data set by the variable or variables you list on the BY statement.

procsortdata=work.students;bySTUDENT_GENDER; run; proc means data=work.students max min std range q1 q3 qrange;varSTUDENT_WEIGHT STUDENT_HIEGHT ;bySTUDENT_GENDER; run;

You now have your descriptive statistics for males and females separately. Along with one missing data in STUDENT_GENDER variable.

**Difference between Class & By Statements:**

- CLASS statement is easier to use than the BY statement, as it doesn’t require a sorting step. If you have a very large data set which is not sorted, you may want to use a CLASS statement. However, if the data set is already in the correct sorted order, a BY statement is more
**efficient**. - If you are using PROC MEANS to print a report and are not creating a summary output data set, the differences in the printed output between a BY and CLASS statement are basically related to layout. CLASS statement would produce a single large table & BY produce separate groups.

### Creating Summarized Data Set – PROC MEANS

We can use PROC MEANS to create a new data set that contains summary information such as sums and means. This data set can then be used for further analysis.

OUTPUT OUT= SAS-data-set statistic=variable(s);

where

= specifies the name of the output data set**OUT**= specifies the summary statistic written out**statistic**specifies the names of the variables to create. It represent the statistics for the analysis variables that are listed in the VAR statement.**variable(s)**

**Example 1:** Without specifying any out variables in Output.

proc means data=work.studentsmaxminnoprint ; var STUDENT_WEIGHT STUDENT_HIEGHT; outputout= my_summary ; run; proc print; run;

*PROC MEANS produces a report by default, NOPRINT option to suppress the default report.*

**Example 2:** Here Specifying variable names in out.

proc means data=work.studentsmaxminnoprint ; var STUDENT_WEIGHT STUDENT_HIEGHT; outputout= my_summarymax= MAX_STUDENT_WEIGHT MAX_STUDENT_HIEGHTmin= MIN_STUDENT_WEIGHT MIN_STUDENT_HIEGHT; run; proc print; run;

**Example 3:** Using **autoname** keyword for variable names in out.

proc means data=work.studentsn max min stdnoprint ; var STUDENT_WEIGHT ; outputout= my_summary n = max = min = std = /autoname; run; proc print; run;

**Example 4:** Including **BY** Statement.

When you would like to output summary statistics for each

level of one or more classification variables

Rememberyou have to first sort your data set by the variable or variables you list on the BY statement.

proc means data=work.student max min noprint ; var STUDENT_WEIGHT STUDENT_HIEGHT; by student_gender; output out = my_summary02 max = MAX_STUDENT_WEIGHT MAX_STUDENT_HIEGHT min= MIN_STUDENT_WEIGHT MIN_STUDENT_HIEGHT; run; proc print; run;

In this data set, *FREQ* represents the number of observations for each value of gender.

**Example 5: **Including **CLASS **Statement.

proc means data=work.student max min noprint ; var STUDENT_WEIGHT STUDENT_HIEGHT;classstudent_gender; output out = my_summary03 max = MAX_STUDENT_WEIGHT MAX_STUDENT_HIEGHT min= MIN_STUDENT_WEIGHT MIN_STUDENT_HIEGHT; run; proc print; run;

Here in the above output first observation in this data set, *TYPE* equal to 0, is the mean for both males and females (Grand Mean) & where *TYPE* equal to 1 represent the means for females and males separately.

If you do not want an observation with the grand mean (see above output *TYPE* equal to 0) in your output data set, use the NWAY option of PROC MEANS.

__You can add multiple class variables in the class statement. Adding two classification variables to the CLASS statement enables you to group your analysis into multiple levels.__

Thanks!

Happy Learning! Your feedback would be appreciated!

Follow @shobhitsinghIN