Research and Studies
Research is an integrated part of the NCA’s professional work. Once a test is developed and administered according to standard procedures, it is immediately followed by several forms of analyses on the item level as well as on the test level. Psychometric characteristics are gathered and studied with the aim of knowing the performance of examinees in general and/or according to their academic or demographic classifications. Evidence-based recommendations are then communicated to concerned bodies, be they within the center or elsewhere. Based on research findings, the Center may undertake necessary modifications to improve its  products. Most of that research is action research. There are some research works that aim at contributing to knowledge accumulation and the result of which may be of interest to special groups as well as to the society at large.  Likewise, there are other research works that may aim at solving certain problems in their practical field. Here we will post peer reviewed research works that are  conducted by the NCA researchers and consultants in both the Arabic and English languages. We hope that they will be of some theoretical or practical values for all visitors of this website.​
  
 
​​
 
Title report
 
Summary
 
The Role of the Department of Study and Major on Teacher’s Knowledge and Attitudes
Georgioes Sideridis
 
 
The purpose of the present study was to evaluate differences due to specialty on teacher’s attitudes, knowledge and skills. Participants were 44,692 teachers from various specialties who completed an attitudinal and knowledge self-reported measure. The effects of age, type of major and experience were evaluated using hierarchical linear modeling techniques. Results indicated that there was a noticeable trend for science majors to have more positive attitudes compared to Art/Education majors. Furthermore, the null role of age was augmented by the fact that increases in age were associated with more positive attitudes for science majors only. The findings have implications for the incentives given to various age and gender groups in the education profession as the trajectories of change in attitudes over time are not necessarily positive across disciplines.
  Identifying Content and Cognitive Dimensions of the Standardized Test of English Proficiency (STEP)
 
Georgios Sideridis
 
 
 
                                                                        The purpose of the present literature review is to provide information on the theoretical background behind the development of the STEP language instrument. Based on the developers of the STEP, the instrument has been heavily influenced by two contemporary theoretical frameworks, one originating in the United States termed the American Council on the Teaching of Foreign Languages (ACTFL), and one originating in Europe termed Common European Framework of Reference (CEFR) for languages. Below there is a brief description of the two schemes with an emphasis on the European system (i.e., CEFR) as it has been more influential in the development of the STEP instrument. Following description of the two language guideline systems there is a critical analysis of both systems which ends with a set of recommendations and future directions.
An Examination of the Impact of Examination Center Size and Gender Balance on Achievement Scores on the GAT
Georgios D. Sideridis
 
 
 
The primary purpose of the present study was to evaluate the likelihood that examination centers were associated with differential performance on the GAT. A secondary purpose was to test various center characteristics as predictors of the initially observed variability of performance centers. Participants were 319218 examinees who took the GAT examination. A multilevel modeling approach was implemented in which examinees were nested within centers and their achievement on the GAT was predicted by the center level predictors of center size, and gender distribution within each center. Results indicated that there was significant variability in the performance of the centers. This variability was significantly predicted by center size in that larger examination centers were associated with significantly elevated performance levels on the GAT. There was also a significant main effect for gender with males having significantly higher performance on the GAT compared to females. However, this effect was not significant after accounting for examination center size. It is concluded that examination centers contain a significant amount of information related to examinees’ performance on the GAT and should be investigated-further.
 
رؤية في واقع التعليم الثانوي وتطويره.
 
د. فيصل بن عبدالله المشاري آل سعود
 
 
 
 
 
 
 
حظي التعليم عمومًا والتعليم الثانوي بشكل خاص، بالكثير من محاولات التغيير والتطوير، وحظي في الوقت نفسه بكثير من النقد والانتقاص.  ولعل هذه الانتقادات كانت الوقود المحرك لمحاولات التغيير والبحث عن الحلول.
وفي رأيي أن كثيرًا من سمات التغيير كانت مقتبسة من أنظمة تعليمية خارجية في محاولة للاستفادة من مميزاتها الظاهرة أحيانًا، ولاعتقادنا أنها لم تكن لتوجد لولا أنها أفضل مما لدينا، أو لأنهم سبقونا في هذا المجال.
ورافق اقتباس أنظمة خارجية وتطبيقها على البيئة السعودية، في كثير من الأحيان، إغفال بعض العوامل المهمة لإنجاح هذا الاقتباس، ومنها: تكييف هذه الأنظمة على البيئة المحلية، والاهتمام بجوهر هذه الأنظمة أكثر من النظر إلى ظواهرها. كما أننا نُغْفِل كثيرًا شروط تطبيقها، وهي بالضرورة شروط نجاحها. ونخطئ كثيرًا عندما نقتبس نظامًا قبل تقصي البدائل المتوفرة واختيار المناسب منها.
 
تحليل محتوى عينة من المقالات التي كتبت عن المركز الوطني للقياس والتقويم في التعليم العالي في الصحف السعودية
 
د. دخيل الدخيل الله
 
 
 
هدفت الدراسة الحالية إلى التعرف على المحتوى الظاهر والباطن لما احتوته عينة من المقالات المنشورة في الصحف السعودية الرئيسة عن المركز الوطني للقياس "قياس" يدخل في ذلك تحديد موضوعات وأهداف وأسباب الكتابة عن مركز "قياس". وكذلك التعرف على أسلوب الكتابة والأسلوب اللغوي للكاتب وخصائص السياق الذي ظهرت فيه. وقد تم تحليل (84) مقالة نشرت في (9) صحف سعودية موزعة بنسب متباينة على المناطق الإدارية الرئيسة من البلاد السعودية، (الوسطى والغربية والشرقية والجنوبية). واستخدم الباحث منهج تحليل المحتوى بشقيه الظاهر والباطن للإجابة عن أسئلة الدراسة وبلوغ أهدافها. وقد انتهت الدراسة إلى توزع كتاب عينة الدراسة من المقالات بين ذكور وإناث بواقع (68) ذكرا، و (16) أنثى. وتراوحت المستويات التعليمية لبعض كتاب المقالات عينة الدراسة  بين المستوى الجامعي في أدنى حد لها، إلى مستوى الدكتوراه في أعلى حد لها. وبلغ عدد الكتاب من مستوى الدكتوراه بينهم (23) كاتبا. ولم تتوفر بيانات عن المستويات التعليمية لبقية كتاب المقالات. وتباينت علاقة كتاب المقالات بموضوعات المقالات وعلاقة خلفياتهم المهنية بها، وإن كانت إلى البعد عن الموضوع أقرب. وظهر من تحليل محتوى المقالات بأن الهدف لأغلبية المقالات الدعاوى مع ضعف في البينات التي تقوم عليها تلك الدعاوى. وكان من بين المقالات ما يهدف إلى مناقشة موضوعٍ من موضوعات القياس، غير أن المناقشة لم تكن الهدف الأساس لأغلبية المقالات. وكشف التحليل لظاهر محتوى عينة الدراسة من المقالات عن تدن ملحوظ في المعرفة بالقياس و المركز الوطني للقياس. وتدنت البذاءة في أسلوب اللغة في كثير من المقالات إلا أنها بدت واضحة في قلة منها أخرى. كما كشف التحليل عن ندرة في تقدير وجهات النظر الأخرى حول موضوع الكتابة بين الكتاب. من ناحية أخرى، أبان تحليل الباطن لمحتوى عينة الدراسة من المقالات عن ندرة في الأفكار ذات المعنى والدلالة، مع توفر مستوى معقول من الجدية، و تورط بعض الكتاب في عدد من المغالطات المنطقية. كذا كشف تحليل الباطن عن غلبة السلبية على مضمون وطبيعة السياق لعينة الدراسة من المقالات. وتتسق نتيجة مضمون وطبيعة السياق مع ما انتهى إليه تحليل عناوين المقالات من غلبة الاتجاه السلبي لكتاب المقالات على الاتجاه الإيجابي لهم نحو موضوعات مقالاتهم. وانتهى الباحث من تحليل المحتوى الظاهر و الباطن لعينة المقالات إلى استنتاج يشير إلى تقارب في طبيعة النتائج الخاصة بكل مستوى من تحليل المحتوى. ويوحي التقارب في طبيعة النتائج بموقف من القياس والمركز الوطني للقياس يبدو في الغالب سلبي. هذا، وقد ذيل الباحث دراسته بقائمة من الاستنتاجات النظرية والتطبيقية المبنية في الأساس على ما انتهت إليه الدراسة من نتائج. أعقب كل من هذه الاستنتاجات بتوصية مبنية عليها. وانتظمت مجموعة الاستنتاجات والتوصيات حول ضرورة العمل على تفعيل قواعد وضوابط النشر الصحفي، والحاجة لتطوير المهارات الأساسية في الكتابة والتحليل والنقد الموضوعي لدى كتاب الصحف. فضلا عن العناية بالتدريب على أدب الحوار والاختلاف وفن التعامل مع الناس.
 
 
Test length and precision: how precise the GAT will be if the test is shortened
 
Dr. Abdullah  Al Qataee
Dr. Abdulrahman Al-Shamrani
 
 
 
 
 
 All the evidence gathered showed a promising investment in short form tests. Reliability evidence, standard error of measurement, and correlations with other forms are all supportive of the use of the short form.​
 
 
Standardized Test of Arabic Proficiency in Speakers of other Languages (STAPSOL) Evidence of its reliability and validity
 
Dr. Amjed Al-Owidha
Dr. Abdulrahman Al-Shamrani
 
 
 
In response to requests from some academic institutions, the National Center for Assessment in Higher Education has developed a Standardized Test of Arabic Proficiency in Speakers of other Languages (STAPSOL). This test follows strict methodology in the way its items are written, reviewed, experimented, maintained and secured.
The test has four components: Reading Comprehension (RC), Structure (ST), and Listening Comprehension (LC), as well as non-scorable trial items. A free writing task is also included.
In order to meet internationally recognized standards, this test is made to insure reliability, validity and fairness to sex, region and level of study. In addition to achieving maximum quality assurance, the test items are written and reviewed by native speakers who are specialists in the fields of Arabic and applied linguistics and measurement. Its listening section is recorded exclusively by native speakers of Arabic. STAPSOL is also in the process of being formally linked to the Common European Framework for Languages (CEFR).
A sample of 1621enrolling in different institution in Jakarta Indonesia took the tests. However, the findings indicated that the test is reliable. Initial construct validity evidence support the validity of the test. However, More studies of the test validity need to be collected.
 
DIF Analysis for Item and Test on the NCA Tests
The General Ability Test (GAT)
Art Major
 
Georgios Sideridis
Ioannis Tsaousis
 
 
 
The aim of this research project is to investigate the presence of bias in the General Aptitude Test (GAT) subscales. Evidence for bias will be provided by use of the Differential Item Functioning (DIF) and Differential Test Functioning (DTF) procedures in Item Response Theory (IRT) models along with the procedure of measurement invariance using the multi-group Confirmatory Factor Analysis technique (MGCFA). Three forms of bias were assessed at the item level: (a) uniform DIF, (b) non-uniform DIF, and, (c) item-latent factor invariance. Item bias involves the presence of differential item functioning on item difficulties and/or item discriminations across gender, school type, provinces, regions, and test forms. Furthermore, bias was assessed at the test level. For this purpose, Differential Test Functioning (DTF) will be implemented. DTF is an analogous to DIF procedure, but instead of examining  between groups differences in Item Characteristic Curves (ICCs), it does so at the test level by examining the Test Characteristic Curves (TCCs).
 
DIF Analysis for Item and Test on the NCA Tests
The General Ability Test (GAT)
Science Major
 
Georgios Sideridis
Ioannis Tsaousis
 
 
 
 
The aim of this research project is to investigate the presence of bias in the General Aptitude Test (GAT) subscales. Evidence for bias will be provided by use of the Differential Item Functioning (DIF) and Differential Test Functioning (DTF) procedures in Item Response Theory (IRT) models along with the procedure of measurement invariance using the multi-group Confirmatory Factor Analysis technique (MGCFA). Three forms of bias were assessed at the item level: (a) uniform DIF, (b) non-uniform DIF, and, (c) item-latent factor invariance. Item bias involves the presence of differential item functioning on item difficulties and/or item discriminations across gender, school type, regions, and test forms. Furthermore, bias was assessed at the test level. For this purpose, Differential Test Functioning (DTF) will be implemented. DTF is an analogous to DIF procedure, but instead of examining  between groups differences in Item Characteristic Curves (ICCs), it does so at the test level by examining the Test Characteristic Curves (TCCs).
 
GAT-Verbal: Testing for Dimensionality and Validation
of Factorial Structure
 
Dimiter M. Dimitrov
 
 
 
This study examines the dimensionality and factorial structure of data on the General Aptitude Test-Verbal Part (GAT-Verbal) administered by the National Center for As­sessment in Higher Education (NCA) at Riyadh, Saudi Arabia. The data consist of binary scores of 15,610 students on 65 multiple-choice items grouped into three content-specific do­mains: Analogy (21 items), Sentence Completion (17 items), and Reading Comprehension (27 items). Testing for dimen­sionality and factorial structure of GAT-Verbal data was con­ducted in the framework of confirmatory factor analysis (CFA). The best data fit was found for a bifactor model of the GAT-Verbal factor structure, with one general factor of verbal aptitude and the three content specific domains as latent aspects of the verbal aptitude. These results provide support to the essential unidimensionality of the test scores and the appropriateness of scoring the student responses on each of the three content-specific domains (Analogy, Sentence Completion, and Reading Comprehension). Five (out of 65) items were identified for misfit to the factorial struc­ture of GAT-Verbal, namely: one item from Analogy, one item from Sentence Completion, and three items from Reading Comprehension. An important practical implication of these results is that it is appropriate to use unidimensional IRT cali­bration of GAT-Verbal and to scale students’ performance on the content-specific domains of the test.
 
Latent Class Analysis of GAT-Quantitative Data
 
Dimiter M. Dimitrov​
 
 
 
This study deals with latent class analysis (LCA) of data on the General Aptitude Test-Quantitative Part (GAT-Quantitative) administered by the National Center for Assessment in Higher Education (NCA)at Riyadh, Saudi Arabia. The data consist of the binary scores of 15,610students on 55 multiple-choice items grouped into five content-specific domains: Arithmetic (20 items), Geometry (12 items), Comparison (8 items), Analysis (10 items), and Algebra (5 items). The goal was to identify latent classes of students who took GAT-Quantitative based on their probabilities to answer the test items correctly. The results from the LCA, performed through the use of the computer program Mplus (Muthén&Muthén, 2010), led to the identification of the following latent classes of students by content-specific domains, (a) six latent classes for Arithmetic, (b)  four latent classes for Geometry, (c) five latent classes for Comparison, (d) three latent classes for Analysis, and (e) three latent classes for  Algebra. Detailed examination of these results can be very helpful in understanding the characteristics of  GAT test-takers grouped into homogeneous classes by item performance and differentiated properties of test items across different classes of examinees.
 
Latent Class Analysis of GAT-Verbal Data
 
Dimiter M. Dimitrov​
 
 
 
This study deals with latent class analysis (LCA) of data on the General Aptitude Test-Verbal Part (GAT-Verbal) administered by the National Center for Assessment in Higher Education (NCA)at Riyadh, Saudi Arabia. The data consist of the binary scores of 15,610students on 65 multiple-choice items grouped into three content-specific domains: Analogy(21 items), Sentence Completion (18 items), and Reading Comprehension (26 items). The goal was to identify latent classes of students who took GAT-Verbal based on their probabilities to answer the test items correctly. The results from the LCA, performed through the use of the computer program Mplus (Muthén & Muthén, 2010), led to the identification of the following latent classes of students by content-specific domains, (a) six latent classes for Analogy, (b) three latent classes for Sentence Completion, and (c) four latent classes for Reading Comprehension. Further examination of these results based on demographic information about students, schools, curricula, and regions can be very helpful in understanding the characteristics of GAT test-takers grouped into homogeneous classes by item performance and differentiated properties of test items across different classes of examinees.
 
IRT and True-Score Analysis
of the NCA Tests for Teaching Skills
 
Dimiter M. Dimitrov​
 
 
 
Item response theory (IRT) analysis of NCA test data is in line with the contemporary approach to obtaining “sample free” accurate estimates of both item parameters and ability scores of examinees. Compared to the classical true-score theory (CTT), the IRT provides higher accuracy and flexibility for the evaluation of person’s ability and item characteristics. However, numerous studies support the argument that IRT and CTT may coexist in making adequate interpretations and decisions based on the test data (e.g.,  Abedlazeez, 2010; Dimitrov, 2003a; 2003b; Fan, 2010; Hambleton, Swaminathan, & Rogers, 1991). Traditionally, CTT measures represent a common focal point between test developers and practitioners as they place the scores and their accuracy on the original scale of measurement, say, number-right (NR) scores in a multiple-choice item (MCI) test. For example, Hambleton, Swaminathan, and  Rogers (1991, p. 85) noted  that  the transformation of IRT ability score to true (or domain) score in CTT has important implications such as (a) negative scores on the IRT logit scale are eliminated, (b) the transformed scale ranges from 0% to 100% if the domain score is used, which is readily interpretable, and (c) when pass-fail decisions must be made, a cutting score is typically set on the domain-score scale. Thus, combining IRT information about person and item parameters with readily interpretable true-score information can positively affect the quality of test development, analysis, and decision making
The approach of combining IRT and CTT information in psychometric analysis of test data requires better understanding of the relationships between IRT concepts and their true-score counterparts from both technical and methodological perspectives. This approach is illustrated in this report with addressing the task for IRT and CTT analysis of data from the NCA test on Teaching Skills.
 
Testing for Unidimensionality of GAT Data
 
Dimiter M. Dimitrov​ 
 
 
 
Testing for (uni)dimensionality of test data is of critical importance for the selection of an appropriate model for data analysis and validation in any measurement framework—classical true-score theory (CTT), item response theory (IRT), or confirmatory factor analysis (CFA). Without going into a review of existing approaches to testing for dimensionality of data, the approach used with this task is  bifactor modeling in the framework of CFA for investigating the dimensionality of data collected through the administration of the GAT assessment at the NCA. The rational for the choice is based on arguments in the research on dimensionality that the bifactor model (a) provides an evaluation of the distortion that may occur when unidimensionality models are fit to multidimensional data , (b) allows researchers to evaluate the utility of forming subscales, and (c) provides an alternative to nonhierarchical multidimensional models for scaling individual differences (e.g., Chen, West,  & Sousa, 2006; Reise, Morizot, & Hays, 2007).
The main question to be addressed with using a bifactor model is whether the data are sufficiently unidimensional to apply a unidimensional IRT model, without significant distortion in item parameters, or a multidimensional item response theory (MIRT) model is more appropriate?” Also, if a unidimensional model is appropriate, are the domain-specific clusters of items substantial and reliable enough to allow for scoring the examinees on such domains? This questions are particularly relevant to the context NCA tests as they typically assume one general dimension (factor) that underlies the examinees’ performance on the test and several domain-specific subdomains.
 
 
Investigation for Potentially Exposed Items in the Context of the Verbal section of the General Aptitude Test (GAT
 
Iasonas Lamprianou
 
 
This study is an extension of an exploratory study which was conducted by the same author in September 2013 under the title “Investigation for Exposed Items in the Context of the General Aptitude Test (GAT)”. The original, exploratory study aimed to investigate whether a specific group of items of the quantitative section of the test may have been exposed to the candidates. However, the General Aptitude Test (GAT) consists of a verbal section as well.
This study replicates the models and procedures of the original study (a) to investigate whether a specific group of items of the verbal section of the test may have been exposed to the candidates; and (b) to identify the response patterns of examinees who may have benefitted from the allegedly “exposed” items on the verbal section of the test.
The results suggest that only a very small number of candidates may have benefitted from the allegedly “exposed” items. Unfortunately, demographic data are not available, so it is not possible to investigate whether those candidates belong to a specific social class or reside in specific areas.
 
 
The Relationship Between Test Takers Characteristics and Their Performance on the New Teachers' Licensure
Test: The Case of Specialty 16 (Special Education)
 
Iasonas Lamprianou
 
 
 
The New Teachers' Licensure Test is used by the National Center for the Assessment in Higher Education as a Licensure Test for individuals aspiring to work as school teachers. This study aimed to investigate whether personal characteristics of the test takers were correlated with their performance on the New Teachers' Licensure Test. It was decided to focus on only one subject for the moment. Specialty 16 (Special Education – visual disability) was chosen because it offered a balanced proportion of male and female candidates.
There were approximately 8300 candidates who took the test on three different occasions (in February, May and June). The candidates in February and June were male and the candidates in May were female. Linear mixed effects models were used to analyze the data.
The results gave somewhat conflicting messages. For the February examination, a practically small but statistically significant variance was found to be attributed to the University of graduation. There was also a practically small but statistically significant proportion of the total variance attributed to the interaction between the University of graduation and the performance of candidates on different sections of the examination. On the contrary, for the June examination, there was only a practically very small and statistically non-significant proportion of the variance attributed to the University of graduation.
For the May examination, the results showed that there was a practically small but statistically significant variance attributed to the University of graduation. There was also a practically small but statistically significant proportion of the total variance attributed to the interaction between the University of graduation and the performance of candidates on different sections of the examination.
The findings of the study are inconclusive. More research needs to be done using additional datasets covering other specialties which usually attract larger number of candidates. The final section of the study offers some suggestions regarding the educational, psychometric and sociological dimensions of this issue.
 
Determinants of success on the New Teachers' Licensure Test
 
Iasonas Lamprianou
 
 
 
 
 
This report is extending the work done in September 2013 by the same author on a similar report which investigated the performance of test takers on the New Teachers' Licensure Test. The title of that original report was “The Relationship Between Test Takers’ Characteristics and Their Performance on the New Teachers' Licensure Test: The Case of Specialty 16 (Special Education)”. That report was exploratory in nature and limited in scope because it only dealt with one subject (Special Education) and it investigated the performance of test takers on three different testing occasions.
It was important to carry out this extension study because of the prominence of the New Teachers' Licensure Test which is used by the National Center for the Assessment in Higher Education as a Licensure Test for individuals aspiring to work as school teachers. The current study aimed to investigate whether personal characteristics of the test takers (as well as other background variables) were correlated with candidates’ performance on the New Teachers' Licensure Test.
Through increasingly complex models, a number of research questions were investigated regarding the probability of individual candidates to pass the examination. The (major) findings may be summarized as:
 
·         The university of graduation has a significant effect on the probability of success.
 
·         Pre-examination training has a differential effect for different universities and Specialties
 
·         The year of graduation has an inconsistent effect on the probability of individual candidates to succeed. In most cases, candidates who graduated more recently had a lower probability of success.
 
In order to interpret these results appropriately, further investigation and deep knowledge of the context of the educational and examination system in the Kingdom of Saudi Arabia is needed