البحوث والدراسات
لازَمَ البحثُ العلمي صناعة الاختبارات في المركز الوطني للقياس منذ البداية؛ إذ يعقب إعداد كل اختبار منظومة من التحليلات على مستوى البند الواحد وعلى مستوى الاختبار لمعرفة الخصائص السيكومترية لكل من المستويين، ولمعرفة أداء المختبرين على العموم، أو بحسب الخصائص الديموغرافية أو التعليمية أو غيرها, وتزود الجهات المعنية سواء داخل المركز أو خارجه بالتوصيات المبنية على نتائج تلك التحليلات, ويجري المركز فيما يخصه ما يلزم من أوجه التطوير المستمر لمنتجاته. وكثير من تلك الدراسات من نوع البحوث الإجرائية أو العملية (Action Research)
وثمة بحوث أخرى تهدف - بالإضافة إلى حل المشكلات المتعلقة بموضوعها- إلى الإسهام في بناء التراكم المعرفي في مجالها, ويهم مجتمع التخصص والمجتمع عامة الاطلاع عليها ومعرفة نتائجها. وفي هذه الصفحة سننشر تباعا ما يجتاز التحكيم من هذا النوع من البحوث والدراسات التي يجريها بتكليف من المركز باحثو المركز أو المتعاونون معه باللغتين العربية والإنجليزية.
 
التقارير الفنيّة:​
 
التقرير
الملخص
 
Reliability in Measurement:
Unilevel and Multilevel Approaches
 
Georgios D. Sideridis
 
 

 

            تحميل | Download

 ​

 

 

 

 

 

The purpose of the present paper was to evaluate the internal consistency reliability of the General Aptitude Test assuming clustered and non-clustered data using commercial software (Mplus). Participants were 2,000 testees who were selected using random sampling from a larger pool of examinees (more than 65k). The measure involved four factors, namely: (a) planning for learning, (b) promoting learning, (c) supporting learning, and (d) professional responsibilities and was hypothesized to comprise a unidimensional instrument assessing generalized skills and competencies. Intra-class correlation coefficients and variance ratio statistics suggested the need to incorporate a clustering variable (i.e., university) when evaluating the factor structure of the measure. Results indicated that single level reliability estimation significantly overestimated the reliability observed across persons and underestimated the reliability at the clustering variable (university). One level reliability was also, at times, lower than the lowest acceptable levels leading to a conclusion of unreliability whereas multilevel reliability was low at the between person level but excellent at the between university level. It is concluded that ignoring nesting is associated with distorted and erroneous estimates of internal consistency reliability of an ability measure and the use of Multilevel Confirmatory Factor Analysis (MCFA) is imperative to account for dependencies between levels of analyses.
 
A Comparison of IRT Theta Estimates and Delta Scores
From the Perspective of Additive Conjoint Measurement
 
Benjamin W. Domingue
&
Dimiter M. Dimitrov
 
 
 
            تحميل | Download
The National Center for Assessment (NCA) is piloting an approach to scoring and equating of tests with binary items referred to as delta-scoring method (Dimitrov, 2015). The purpose of this study was to evaluate the intervalness of the D scores, obtained under the delta-scoring method, in comparison with the intervalness of “theta” scores obtained under the three-parameter logistic model in item response theory (IRT). The question of interest was which scores (D or theta) are more consistent with the axioms of additive conjoint measurement (ACM; Luce & Tukey, 1964). This question was addressed through the approach of ConjointChecks (Domingue, 2014) with the use of real data from a large-scale assessment at the NCA. This study provides evidence that the D scores produce fewer violations of the ordering axioms of ACM than do the theta scores.
Assessing Professional Knowledge of Teachers in Saudi Arabia: The Moderating Effects of Gender, Age, Year of Graduation, and Experience Using Multilevel Modeling
 
Georgios D. Sideridis
  
 
 
 
 
 
 
 
The purpose of the present study was to test the proposition that gender, age, and experience exert differential effects over teacher’s professional knowledge, promotion of learning, support of learning and professional responsibility. Participants were 45,732 teachers who took the measure during a one-time administration. Results, through fitting a multilevel model with university being the clustering variable, indicated that the interaction between gender X age, and gender X experience were significant signaling the differential effects of both variables. Specifically for gender, females had significantly lower scores across most subscales of the instrument compared to males as a function of age. That is, as females grew older they had significantly lower scores. The opposite trend was observed for experience. That is, females with more experience had a significantly more positive trajectory of growth on those factors compared to males. The findings are discussed in light of the independence between the measures of age and experience.
 
 
 
 
 
 
 
 
The Effects of Stem and No-Stem in a Listening Comprehension Task: Evidence from Two Different Databases
 
Georgios D. Sideridis
 
 
 
 
 
 
The purpose of the present study was to test the hypothesis that the introduction of a stem condition (presentation of questions in a visual form) in a listening comprehension task would aid performance and would thus, result in superior performance compared to the absence of a stem. This assumption was based on the fact that working memory would be greatly burdened in a listening comprehension task having to absorb and keep a lot of information, thus, any condition that would free up cognitive resources should be beneficial for the examinee. Results indicated that there were no statistically significant differences between the stem and non-stem conditions across a series of tests using Item Response Theory (IRT - Bock, & Moustaki, 2007) suggesting that other factors may be more salient in the prediction of performance in listening comprehension. Test characteristic curves and test information functions were almost identical between the two conditions. Items for which stem no-stem differences were observed using Differential Item Functioning were flagged. It is concluded that achievement in listening comprehension is complex and requires an orchestration of various skills and competencies that need to be investigated further.
 
 
The impact of teacher background characteristics on achievement in the general section of the Teachers test in Saudi Arabia
  
 
Khurrem Jehangir
 
تحميل | Download
 
 
 
 
 
 
 
 
 
 
 
 
 
The aim of this study is to investigate the relationship between teacher background characteristics and achievement on the general section of the Teachers test. The background variables that are analyzed in this study are the previous university GPA of the examinees, whether they attended a teacher college or a regular university, the nature of their enrollment in the educational institution (full time or distance learning), duration of external training/coaching that they might have received before the exam, their age and lastly the time elapsed since the year of graduation from the university/college. The outcomes of this study can indicate how these background factors impact the achievement levels of examinees on the general section of the teachers test. The analyses are conducted individually for the above mentioned variables and also together to see the independent effects and also the conditional effects after accounting for other variables. The dataset consisted of 168,597 teachers who took the measure during four administrations. The test outcomes across the four administrations used in this study were equated by the NCA prior to the analysis undertaken for this report so meaningful comparisons could be undertaken. The details on the equating methodology used etc. are not within the ambit of this report and can be inquired from the Professional Testing Department of the NCA. The dependent variable in the regression models employed in this study was the equated total score. The instrument related to teacher evaluation of their knowledge and skills was comprised of four factors, namely, (a) Professional Knowledge, (b) Promoting Learning, (c) Supporting Learning, (d) Professional Responsibility. More information about the instrument is available at the NCA from the Professional Testing Department. The author currently did not have more information regarding its development, its link to theory and measurement aspects of the instrument. More details on the description of the sub-sections of the test is available from the NCA.
Multilevel models with university being the clustering variable were used. The results indicated that all the selected variables affected test scores and there were also significant interaction effects between certain variables resulting in a differential impact of these variables across the different examinee subgroups
Examining the prevalence and impact of non- attempted items in NCA educational tests
Iasonas Lamprianou
 
 
 
 
 
 
 
 
 
The aim of this study is to examine whether NCA achievement test scores are affected by response strategy decisions. The dataset consisted of the responses of 34,500 examinees to 52 verbal and 44 quantitative items.
It was found that the frequency of missing responses in the data was very small, both for the Verbal and the Quantitative tests. The examinees who produced missing responses on the one test, also tended to produce missing responses on the other test. Coding the missing responses as missing rather than as incorrect did not affect either the model-data fit of the Rasch models, or the difficulty estimates of the items. Also, coding the missing responses as missing rather than as incorrect did not affect noticeably the ability estimates of the overall sample. It was not possible to find evidence that examinees of lower ability tended to produce more missing responses.
It is suggested that, although the phenomenon is not important enough to cause concerns regarding the validity and the reliability of the examination results, it should be monitored regularly.
 
 
Modeling Group Specific Differential Item Functioning in the STAPSOL Test
Khurrem Jehangir
 
 
 
 
Fit to item response theory (IRT) models in educational testing can be compromised by the presence of group-specific differential item functioning (GDIF).  The current study proposes methods to detect GDIF and explores the feasibility of improving the fit of the measurement model by using group-specific item parameters to model GDIF. In this approach, it is assumed that a scale consist of both items which are free of GDIF and items with GDIF. The first set of items ensures the validity of the measure across groups. The second set of items is calibrated concurrently with the first set of items and both sets of items contribute to measurement precision. The procedure is used to model high DIF items in the STAPSOL test of Arabic proficiency. Using data of the groups that participated in this exam, concurrent maximum marginal likelihood (MML) estimates of the parameters of the Rasch model (1PLM) are obtained. Then information on observed and expected response frequencies is used to identify GDIF items. Group-specific item parameters are introduced for the items with the largest effect sizes of GDIF and new MML estimates are obtained. The impact of using group-specific item parameters is evaluated by comparing the improvement in the fit of the IRT measurement model and also looking at the means of the groups on the latent variables measured without and with a model for GDIF.
 
 
 
An Investigation of Dimensionality and Psychometric Properties of the Teacher Test- Mathematics
 
 
 
  
Dimiter M. Dimitrov,
Abdulla AlSadaawi
 
تحميل | Download
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
The Teacher Test-Mathematics (TT-M) is administered by the National Center for Assessment in Higher Education as a large-scale assessment tool for teacher certification. This study investigates the dimensionality, reliability, differential item functioning on gender, and other psychometric features of the test in the framework of confirmatory factor analysis (CFA) and item response theory (IRT). Classical statistics of TT-M items are also provided. The data come from the responses of 1,177 teachers on 50 dichotomously scored items (1 = correct, 0 = incorrect) on a base test form of the TT-M (Form 704). The CFA results showed that the TT-M data are essentially unidimensional; that is, there is one dominant dimension (ability, trait) underlying the scores of the examinees on the test.
The reliability for the TT-M data, estimated with the use of latent variable modeling, is 0.803, with 95%CI = (0.772, 0.834), which is adequate for the purpose of this study. The IRT results for TT-M data indicated that the overall fit of TT-M items to the three-parameter logistic model is adequate. It was also found that the match between the distributions of the ability of examinees and the difficulty of the items is not fully adequate and some more items are needed in an interval of ability scores above the average, which is important for decision on the certification of teachers based on their TT-M performance.
The results related to differential items functioning (DIF) on gender indicated that attention should be paid to four items that exhibit DIF, with two items signaled for DIF against females and two items signaled for DIF against males. While attention should be paid to the DIF for these four items in subsequent revision of test items, there is no particular concern about the overall DIF effect for the entire test due to cancelation of positive and negative the discrepancy magnitudes across items.
The conclusion is that the results with this technical report can be useful to researchers at the NCA in the practice of using and interpreting data from the TT-M and its further development and improvement in terms of psychometric features and validity. 
 
Factorial Invariance and Latent Mean Differences of Scores on SAAT across Gender 
Ioannis Tsaousis
  
 
 
 
 
 
 
This study examined the factorial structure of data on the Standard Achievement Admission Test (SAAT). SAAT has been administered by the National Center for Assessment in Higher Education (NCA) at Riyadh, Saudi Arabia. A total of 63,380 individuals participated in this study. From them, the 36,277 (57,2%) were males and the 27,041 (42,7%) were females. Sixty two participants (0.1%) did not reported their gender. We examined three models: a) a one-factor model, where all indicators loading on one general achievement factor, b) a four-factor model with four latent constructs describing the variability among the SAAT items, and b) a bifactor model, where all indicators modeled to load simultaneously onto one general achievement factor as well as onto their corresponding sub-scales. The results showed that although all models showed good fit, the bifactor model fitted the data better than the alternative models. The results also revealed that the factor loadings and the item intercepts of the SAAT were invariant across gender. Finally, the third aim of the study was to test possible gender differences between different SAAT latent constructs. Using the Latent Mean Difference procedure, we found that females scored higher than males on the Biology, Chemistry and Global achievement domains, while males scored higher than females on the Physics and Mathematics domains. Future studies to enhance the validity of the SAAT are discussed.
 
 
 
 
 
 
NEW TEACHER TEST: Factor
 
Structure and Reliability
 
 
 
 
Dimiter M. Dimitrov
& &
Abdullah AlSadaawi
 
       
 
 
 
 
 
 
 
 
 
 
 
 
 
This study investigates the dimensionality and factor structure of data on the Teacher Test (TT) administered by the National Center for Assessment in Higher Education as an assessment tool for teacher certification. The score reliability of the TT is also estimated. The data consist of the scores of 65,496 examinees on 79 multiple-choice items of which 67 operational test items and 12 external items. The operational items are associated with four test domains, Professional Knowledge (34 items), Promoting Learning (13) items, Supporting Learning (15 items), and Professional Responsibilities (5 items). The results from the comparison of rival confirmatory factor models show that the test data are essentially unidimensional with reliability of .873. The results also suggest that the four  test domains do not represent unique aspects of a general factor but, instead, they can be treated as domain-related components of a single factor that underlies the test data. In addition, the results from the item response theory (IRT) calibration of test items indicate that, with a few exceptions, the items fit the three-parameter logistic model in IRT and that there is good match between the distributions of examinees’ abilities and item difficulties. Also, the IRT ability scores and the raw (number-correct) scores of the examinees are normally distributed. . The results in this report can be useful for valid interpretations of TT scores and further refinement of the test.
 
 
 
 
 
 
 
 
 
 
The Role of the Department of Study and Major on Teacher’s Knowledge and Attitudes
 
Georgioes Sideridis
 
 
 
 
 
 
 
 
 
 
The purpose of the present study was to evaluate differences due to specialty on teacher’s attitudes, knowledge and skills. Participants were 44,692 teachers from various specialties who completed an attitudinal and knowledge self-reported measure. The effects of age, type of major and experience were evaluated using hierarchical linear modeling techniques. Results indicated that there was a noticeable trend for science majors to have more positive attitudes compared to Art/Education majors. Furthermore, the null role of age was augmented by the fact that increases in age were associated with more positive attitudes for science majors only. The findings have implications for the incentives given to various age and gender groups in the education profession as the trajectories of change in attitudes over time are not necessarily positive across disciplines.
 
 
 
 
 
Identifying Content and Cognitive Dimensions of the Standardized Test of English Proficiency (STEP)
 
 
 
Georgios Sideridis
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
The purpose of the present literature review is to provide information on the theoretical background behind the development of the STEP language instrument. Based on the developers of the STEP, the instrument has been heavily influenced by two contemporary theoretical frameworks, one originating in the United States termed the American Council on the Teaching of Foreign Languages (ACTFL), and one originating in Europe termed Common European Framework of Reference (CEFR) for languages. Below there is a brief description of the two schemes with an emphasis on the European system (i.e., CEFR) as it has been more influential in the development of the STEP instrument. Following description of the two language guideline systems there is a critical analysis of both systems which ends with a set of recommendations and future directions. The purpose of the present literature review is to provide information on the theoretical background behind the development of the STEP language instrument. Based on the developers of the STEP, the instrument has been heavily influenced by two contemporary theoretical frameworks, one originating in the United States termed the American Council on the Teaching of Foreign Languages (ACTFL), and one originating in Europe termed Common European Framework of Reference (CEFR) for languages. Below there is a brief description of the two schemes with an emphasis on the European system (i.e., CEFR) as it has been more influential in the development of the STEP instrument. Following description of the two language guideline systems there is a critical analysis of both systems which ends with a set of recommendations and future directions.
 
 
 
 
The correlation of Item-writer Errors  
 
Iasonas Lamprianou 
 
 
 
 
 
The quality of examination tests is one of the major concerns of examination boards and testing services around the world. As Osterlind (1989) suggested, “constructing test items for standardized tests of achievement, ability, and aptitude is a task of enormous importance—and one fraught with difficulty. The task is important because test items are the foundation of written tests...” (p. 1). The test items, however, are usually prepared by item writers who may be considered to be experts on the subject matter and also have received training on test construction. The extent to which item writers make errors during item writing exercises has been surprisingly neglected as a research topic in the literature.
 
 
 
 
 
 
 
 
Using Confirmatory Factor Analysis in Testing for the Reliability and Validity of General Aptitude Test (GAT) Scores for Postgraduate Students.
 
 
Ioannis Tsaousis
 
  
 
 
 
 
This study presents data regarding the reliability and the internal structure of the General Aptitude Test (GAT) for Postgraduate students. To estimate internal consistency at both, the domain and the sub-scale level, we used the omega (ω) index as method of estimation, since alpha reliability could not be applied due to the fact that the assumption of tau (τ) equivalency was violated in our data. We found that, at domain level, both GAT-Post scales (i.e. verbal and quantitative) had acceptable internal consistency indices. At the sub-scale level, the vast majority of the sub-scales showed also acceptable values. Only two sub-scales “Comparison” and "Critical Thinking" had omega index below the minimum accepted value. We also examined the internal structure and convergent validity of the GAT-Post by applying a protocol suggested by Gorsuch (1983), who suggested that fixing specific parameters (e.g. covariances, factor loadings, etc.) we could test different types of validity (e.g. content, convergent, etc.). By applying this method, we provided evidence to support the validity of the test. We conclude that the current study provides important information regarding the reliability and the dimensionality of the GAT-Post. We also suggest that some items should be carefully examined by NCA experts to decide whether they should be removed from the test, without jeopardizing the content representation of the sub-scales that compose the test.  
 
رؤية في واقع التعليم الثانوي وتطويره.
 
د. فيصل بن عبدالله المشاري آل سعود
 
 
 
 
 
 
 
حظي التعليم عمومًا والتعليم الثانوي خصوصًا، بكثير من محاولات التغيير والتطوير، وحظي في الوقت نفسه بكثير من النقد والانتقاص.  ولعل هذه الانتقادات كانت الوقود المحرك لمحاولات التغيير والبحث عن الحلول.
وفي رأيي أن كثيرًا من سمات التغيير كانت مقتبسة من أنظمة تعليمية خارجية في محاولة للاستفادة من مميزاتها الظاهرة أحيانًا، ولاعتقادنا أنها لم تكن لولا أنها أفضل مما لدينا، أو لأنهم سبقونا في هذا المجال.
ورافق اقتباس أنظمة خارجية وتطبيقها على البيئة السعودية، في كثير من الأحيان، إغفال بعض العوامل المهمة لإنجاح هذا الاقتباس، ومنها: تكييف هذه الأنظمة على البيئة المحلية، والاهتمام بجوهر هذه الأنظمة أكثر من النظر إلى ظواهرها. كما أننا نُغْفِل كثيرًا شروط تطبيقها، وهي بالضرورة شروط نجاحها. ونخطئ كثيرًا عندما نقتبس نظامًا قبل تقصي البدائل المتوفرة واختيار المناسب منها.
 
 
Assessing the Academic Performance at Pre-College, College, and Post-College Levels for Universities in Saudi Arabia
 
 
 
Dimiter Dimitrov
Khaleel Al-Harbi,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Purpose: The goal is to illustrate an approach to assessing the academic performance of college students in Saudi Arabia on (a) prior-to-college admission tests, (b) college GPA and coursework grades, and (c) post-college outcomes such as teacher licensure tests.
Method: Data on pre- and post-college measures are available for a nationally representative sample across universities in Saudi Arabia. For a targeted college from this sample, (a) the performance on pre- and post-college tests is compared to that of other colleges with the same profile of major, (b) college GPA and coursework grades are analyzed for problematic aspects of student performance across semesters, and (c) pre-college admission tests are investigated for predictive validity on college grades and post-college professional outcomes.
Results: With data for a specific college, the results indicated that (a) the college performance is at the national average on pre-college tests and post-college teacher licensure and aptitude tests, (b) the admission tests are valid predictors for college success and post-college outcomes, with unique contribution of some pre-college measures, and (c) there are some sharp fluctuations in GPA profiles across semesters and areas of major that need an investigations for problems associated with curricula, prerequisites for courses and so forth.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Examining Students’ Performance on Pre-College Tests (GAT and SAAT), College GPA, and Post-College Tests (PGAT and Teacher Test): The Case of Science Colleges at Saudi Universities
 
 
Dimiter Dimitrov
Khaleel Al-Harbi,
 
 
 
 
 
The purpose of this study was twofold. First, to examine the students’ performance within a targeted College of Science on pre-college admission tests (GAT and SAAT) and their GPA profiles across semesters and course grades on major. Second, to compare the College of Science with other colleges that have the same major areas (Science and Information, Physics, and Math) on the pre-college tests GAT and SAAT, as well as on the post-college tests PGAT and Teacher Test. The results indicated that the students performed slightly above the average scale norm on GAT but substantially below the average on SAAT. The GPA profiles of the students varied across semesters and major (Computer Science & Information, Physics, and Math). On post-college assessments, the students perform slightly above the average of PGAT, but substantially below the average scale norm on the Teacher Test. Furthermore, the NCA admission tests (GAT and SAAT) predict well the students’ performance on the Teacher Test, but to a lesser degree the overall GPA on college coursework The comparison of the targeted College of Science with similar colleges from other universities showed that there are no mean differences, but the students from the targeted College of Science are represented at higher rate at the extreme categories (low and high) of five performance levels on GAT, SAAT, PGAT, and Teacher Test. Besides the usefulness of the findings for the administration and educators from the targeted College of Science, the study design and methodology can be used by other institutions in efforts on improving curricula and quality of student learning.
 
 
 
تحليل محتوى عينة من المقالات التي كتبت عن المركز الوطني للقياس والتقويم في التعليم العالي في الصحف السعودية
 
د. دخيل الدخيل الله
 
 
 
هدفت الدراسة الحالية إلى التعرف على المحتوى الظاهر والباطن لما احتوته عينة من المقالات المنشورة في الصحف السعودية الرئيسة عن المركز الوطني للقياس "قياس" يدخل في ذلك تحديد موضوعات وأهداف وأسباب الكتابة عن مركز "قياس". وكذلك التعرف على أسلوب الكتابة والأسلوب اللغوي للكاتب وخصائص السياق الذي ظهرت فيه. وقد حُلل (84) مقالة نشرت في (9) صحف سعودية موزعة بنسب متباينة على المناطق الإدارية الرئيسة من البلاد السعودية، (الوسطى والغربية والشرقية والجنوبية). واستخدم الباحث منهج تحليل المحتوى بشقيه الظاهر والباطن للإجابة عن أسئلة الدراسة وبلوغ أهدافها. وقد انتهت الدراسة إلى توزع كتاب عينة الدراسة من المقالات بين ذكور وإناث بواقع (68) ذكرا، و (16) أنثى. وتراوحت المستويات التعليمية لبعض كتاب المقالات عينة الدراسة  بين المستوى الجامعي في أعلى حد لها. وبلغ عدد الكتاب من مستوى الدكتوراه بينهم (23) كاتبا. ولم تتوفر بيانات عن المستويات التعليمية لبقية كتاب المقالات. وتباينت علاقة كتاب المقالات بموضوعات المقالات وعلاقة خلفياتهم المهنية بها، وإن كانت إلى البعد عن الموضوع أقرب. وظهر من تحليل محتوى المقالات بأن الهدف لأغلبية المقالات الدعاوى مع ضعف في البينات التي تقوم عليها تلك الدعاوى. وكان من بين المقالات ما يهدف إلى مناقشة موضوعٍ من موضوعات القياس، غير أن المناقشة لم تكن الهدف الأساس لأغلبية المقالات. وكشف التحليل لظاهر محتوى عينة الدراسة من المقالات عن تدن ملحوظ في المعرفة بالقياس و المركز الوطني للقياس. وتدنت البذاءة في أسلوب اللغة في كثير من المقالات وبدت واضحة في قلة منها. كما كشف التحليل عن ندرة في تقدير وجهات النظر الأخرى حول موضوع الكتابة بين الكتاب. من ناحية أخرى أبان تحليل الباطن لمحتوى عينة الدراسة من المقالات عن ندرة في الأفكار ذات المعنى والدلالة، مع توفر مستوى معقول من الجدية، و تورط بعض الكتاب في عدد من المغالطات المنطقية. كذا كشف تحليل الباطن عن غلبة السلبية على مضمون وطبيعة السياق لعينة الدراسة من المقالات. وتتسق نتيجة مضمون وطبيعة السياق مع ما انتهى إليه تحليل عناوين المقالات من غلبة الاتجاه السلبي لكتاب المقالات على الاتجاه الإيجابي لهم نحو موضوعات مقالاتهم. وانتهى الباحث من تحليل المحتوى الظاهر و الباطن لعينة المقالات إلى استنتاج يشير إلى تقارب في طبيعة النتائج الخاصة بكل مستوى من تحليل المحتوى. ويوحي التقارب في طبيعة النتائج بموقف من القياس والمركز الوطني للقياس يبدو في الغالب سلبي. هذا، وقد ذيل الباحث دراسته بقائمة من الاستنتاجات النظرية والتطبيقية المبنية في الأساس على ما انتهت إليه الدراسة من نتائج. أعقب كلاًّ من هذه الاستنتاجات بتوصية مبنية عليها. وانتظمت مجموعة الاستنتاجات والتوصيات حول ضرورة العمل على تفعيل قواعد وضوابط النشر الصحفي، والحاجة لتطوير المهارات الأساسية في الكتابة والتحليل والنقد الموضوعي لدى كتاب الصحف. فضلا عن العناية بالتدريب على أدب الحوار والاختلاف وفن التعامل مع الناس.
 
Test length and precision: How precise the GAT will be if the test is shortened
 
Dr.Abdullah al-Qataee
Dr.Abdulrahman Al-shamrani
 
 
 All the evidence gathered showed a promising investment in short form tests. Reliability evidence, standard error of measurement, and correlations with other forms are all supportive of the use of the short form.​
 
 
Using Rasch model to Examine the psychometric  properties of the General Aptitude Test for postgraduate students( GAT-Post)
 
Ioannis Tsaousis
 
 
 
The aim of this study was to evaluate the psychometric characteristics of the General Aptitude Test for Postgraduate students (GAT-Post). For this purpose, we applied the Rasch model to examine the dimensionality of each of the sub-scales of the test, the person and item misfit, and the person and item reliability. Furthermore, we examined the amount of information yielded by the items that compose each sub-scale of the test. In terms of dimensionality, the Principle Components Analysis (PCA) of the standardized residuals showed that all the GAT-Post sub-scales are unidimensional, although in most cases the amount of the explained variance from the Rasch dimension was rather low. With regard to the reliability, it was found that at the person level, the index was rather low, which suggests that more items should be added at the two ends of the ability scale (high and low ability). This finding was also supported by other criteria used, such as the item-person map and the Test Information Function (TIF). Finally, using different item fit statistics (e.g. infit –outfit) several items were identified as potentially misfitting the scale. We suggest a careful re-examination of all of these items, and we recommend that if any of these are of a lesser importance, they could be potentially eliminated.
Standardized test of Arabic proflciency in speakers of other Languages (STAPSOL): evidence of its reliability and validity
 
Dr. Amjed al-owidha
Dr.Abdulrahman Al-shamrani
    
 
 
In response to requests from some academic institutions, the National Center for Assessment in Higher Education has developed a Standardized Test of Arabic Proficiency in Speakers of other Languages (STAPSOL). This test follows strict methodology in the way its items are written, reviewed, experimented, maintained and secured.
The test has four components: Reading Comprehension (RC), Structure (ST), and Listening Comprehension (LC), as well as non-scorable trial items. A free writing task is also included.
In order to meet internationally recognized standards, this test is made to insure reliability, validity and fairness to sex, region and level of study. In addition to achieving maximum quality assurance, the test items are written and reviewed by native speakers who are specialists in the fields of Arabic and applied linguistics and measurement. Its listening section is recorded exclusively by native speakers of Arabic. STAPSOL is also in the process of being formally linked to the Common European Framework for Languages (CEFR).
A sample of 1621enrolling in different institution in Jakarta Indonesia took the tests. However, the findings indicated that the test is reliable. Initial construct validity evidence support the validity of the test. However, More studies of the test validity need to be collected.
 
 
Factorial invariance and latent mean differences of scores on GAT across Gender
Ioannis Tsaousis
 
 
 
 
 
 
 
 
 
 
 
This study examined the factorial structure of data on the General Aptitude Test administered by the National Center for Assessment in Higher Education (NCA) at Riyadh, Saudi Arabia. A total of 8,272 individuals participated in this study. From them, 7,409 (89,6%) were females and 863 (10,4%) were males. We examined two models: a second-order (hierarchical) model with seven latent variables as a function of two higher order cognitive factors (i.e. verbal and numerical). This is the factor structure of GAT as it is suggested by NCA experts. Additionally, we
 
examined a two-factor model, with two latent variables (i.e. verbal and numerical) and their corresponding indicators consisting of the seven cognitive sub-scales. The results showed that both models fit the data very satisfactorily providing evidence regarding the structural validity of the GAT. The second aim of this study was to test the measurement invariance (i.e. configural, metric, and scalar invariance) of both examined models. The results showed that both models exhibit configural, metric, and scalar invariance. Finally, the third aim of the study was to test possible gender differences between scales and sub-scales. Using the Latent Mean Difference procedure, we found that males scored higher than females on the numerical domain, while females scored higher than males on the verbal domain. At the sub-scale level, the analysis showed that females scored significantly higher than males on word meaning, Sentence Completion, and Analogy latent domains, while males scored significantly higher than females on the Arithmetic and Geometry latent domains. There were no statistically significant difference between males and females on the Reading Comprehension and Analysis latent domains.
 
DIF Analysis for ltem and Test on NCA Tests the General Abilit Test (GAT) Art Major
 
Georgios Sideridis
Ioannis Tsaousis
 
 
 
The aim of this research project is to investigate the presence of bias in the General Aptitude Test (GAT) subscales. Evidence for bias will be provided by use of the Differential Item Functioning (DIF) and Differential Test Functioning (DTF) procedures in Item Response Theory (IRT) models along with the procedure of measurement invariance using the multi-group Confirmatory Factor Analysis technique (MGCFA). Three forms of bias were assessed at the item level: (a) uniform DIF, (b) non-uniform DIF, and, (c) item-latent factor invariance. Item bias involves the presence of differential item functioning on item difficulties and/or item discriminations across gender, school type, provinces, regions, and test forms. Furthermore, bias was assessed at the test level. For this purpose, Differential Test Functioning (DTF) will be implemented. DTF is an analogous to DIF procedure, but instead of examining  between groups differences in Item Characteristic Curves (ICCs), it does so at the test level by examining the Test Characteristic Curves (TCCs).
 
Factorial Structure of GAT and its Measurement Invariance across School Types
 
 
 
Ioannis Tsaousis
 
 
The aim of this study was threefold: first, we investigate the factorial structure of the General Aptitude Test (GAT) administered by the National Center for Assessment in Higher Education (NCA) at Riyadh, Saudi Arabia. A total of 8,272 individuals participated in this study. We examined two models: a second-order (hierarchical) model with seven latent variables as a function of two higher order cognitive factors (i.e. verbal and numerical). This is the factor structure of GAT that is suggested by the NCA experts. Additionally, we examined a bifactor model, where all indicators modeled to load simultaneously onto one general cognitive factor as well as onto their corresponding sub-scales. The results showed that although both models showed adequate fit, the second-order model fits the data better than the bifactor model. This study also builds on existing validity support for the GAT by evaluating its factorial invariance across different types of schools (i.e. public vs. private schools) in Saudi Arabia. The results showed that the second-order model exhibits configural, metric, and scalar invariance. Finally, the third aim of the study was to test possible school type differences on GAT's scales and sub-scales. Using the Latent Mean Difference procedure, we found that students from private schools scored higher than students from public schools on both GAT latent domains (i.e. verbal and numerical). At the sub-scale level, the analysis showed that students from private schools scored higher than students from public schools on all GAT's latent domains, except from the Arithmetic and Geometry latent domains, were there was no statistically significant difference between the two types of school.
 
DIF Analysis for ltem and Test on the NCA Tests The General Ability Test (GAT) Science Major
 
Georgios Sideridis
Ioannis Tsaousis
 
 
 
 
The aim of this research project is to investigate the presence of bias in the General Aptitude Test (GAT) subscales. Evidence for bias will be provided by use of the Differential Item Functioning (DIF) and Differential Test Functioning (DTF) procedures in Item Response Theory (IRT) models along with the procedure of measurement invariance using the multi-group Confirmatory Factor Analysis technique (MGCFA). Three forms of bias were assessed at the item level: (a) uniform DIF, (b) non-uniform DIF, and, (c) item-latent factor invariance. Item bias involves the presence of differential item functioning on item difficulties and/or item discriminations across gender, school type, regions, and test forms. Furthermore, bias was assessed at the test level. For this purpose, Differential Test Functioning (DTF) will be implemented. DTF is an analogous to DIF procedure, but instead of examining  between groups differences in Item Characteristic Curves (ICCs), it does so at the test level by examining the Test Characteristic Curves (TCCs).
 
GAT-Verbal: Testing for Dimensionality and Validation of Factorial structure
 
Dimiter M. Dimitrov
 
 
 
This study examines the dimensionality and factorial structure of data on the General Aptitude Test-Verbal Part (GAT-Verbal) administered by the National Center for As­sessment in Higher Education (NCA) at Riyadh, Saudi Arabia. The data consist of binary scores of 15,610 students on 65 multiple-choice items grouped into three content-specific do­mains: Analogy (21 items), Sentence Completion (17 items), and Reading Comprehension (27 items). Testing for dimen­sionality and factorial structure of GAT-Verbal data was con­ducted in the framework of confirmatory factor analysis (CFA). The best data fit was found for a bifactor model of the GAT-Verbal factor structure, with one general factor of verbal aptitude and the three content specific domains as latent aspects of the verbal aptitude. These results provide support to the essential unidimensionality of the test scores and the appropriateness of scoring the student responses on each of the three content-specific domains (Analogy, Sentence Completion, and Reading Comprehension). Five (out of 65) items were identified for misfit to the factorial struc­ture of GAT-Verbal, namely: one item from Analogy, one item from Sentence Completion, and three items from Reading Comprehension. An important practical implication of these results is that it is appropriate to use unidimensional IRT cali­bration of GAT-Verbal and to scale students’ performance on the content-specific domains of the test.
 
 
 
Latent Class Analysis of GAT-Qunatitative Data
 
Dimiter M. Dimitrov​
 
 
 
This study deals with latent class analysis (LCA) of data on the General Aptitude Test-Quantitative Part (GAT-Quantitative) administered by the National Center for Assessment in Higher Education (NCA)at Riyadh, Saudi Arabia. The data consist of the binary scores of 15,610students on 55 multiple-choice items grouped into five content-specific domains: Arithmetic (20 items), Geometry (12 items), Comparison (8 items), Analysis (10 items), and Algebra (5 items). The goal was to identify latent classes of students who took GAT-Quantitative based on their probabilities to answer the test items correctly. The results from the LCA, performed through the use of the computer program Mplus (Muthén&Muthén, 2010), led to the identification of the following latent classes of students by content-specific domains, (a) six latent classes for Arithmetic, (bfour latent classes for Geometry, (c) five latent classes for Comparison, (d) three latent classes for Analysis, and (e) three latent classes for  Algebra. Detailed examination of these results can be very helpful in understanding the characteristics of  GAT test-takers grouped into homogeneous classes by item performance and differentiated properties of test items across different classes of examinees.
 
 
 
Latent Class Analysis  of GAT-Verbal Data
 
Dimiter M. Dimitrov​
 
 
 
This study deals with latent class analysis (LCA) of data on the General Aptitude Test-Verbal Part (GAT-Verbal) administered by the National Center for Assessment in Higher Education (NCA)at Riyadh, Saudi Arabia. The data consist of the binary scores of 15,610students on 65 multiple-choice items grouped into three content-specific domains: Analogy(21 items), Sentence Completion (18 items), and Reading Comprehension (26 items). The goal was to identify latent classes of students who took GAT-Verbal based on their probabilities to answer the test items correctly. The results from the LCA, performed through the use of the computer program Mplus (Muthén & Muthén, 2010), led to the identification of the following latent classes of students by content-specific domains, (a) six latent classes for Analogy, (b) three latent classes for Sentence Completion, and (c) four latent classes for Reading Comprehension. Further examination of these results based on demographic information about students, schools, curricula, and regions can be very helpful in understanding the characteristics of GAT test-takers grouped into homogeneous classes by item performance and differentiated properties of test items across different classes of examinees.
 
 
 
IRT and trur-Score Analysis of NVA tests for Teaching skills
 
Dimiter M. Dimitrov​
 
 
 
Item response theory (IRT) analysis of NCA test data is in line with the contemporary approach to obtaining “sample free” accurate estimates of both item parameters and ability scores of examinees. Compared to the classical true-score theory (CTT), the IRT provides higher accuracy and flexibility for the evaluation of person’s ability and item characteristics. However, numerous studies support the argument that IRT and CTT may coexist in making adequate interpretations and decisions based on the test data (e.g.,  Abedlazeez, 2010; Dimitrov, 2003a; 2003b; Fan, 2010; Hambleton, Swaminathan, & Rogers, 1991). Traditionally, CTT measures represent a common focal point between test developers and practitioners as they place the scores and their accuracy on the original scale of measurement, say, number-right (NR) scores in a multiple-choice item (MCI) test. For example, Hambleton, Swaminathan, and  Rogers (1991, p. 85) noted  that  the transformation of IRT ability score to true (or domain) score in CTT has important implications such as (a) negative scores on the IRT logit scale are eliminated, (b) the transformed scale ranges from 0% to 100% if the domain score is used, which is readily interpretable, and (c) when pass-fail decisions must be made, a cutting score is typically set on the domain-score scale. Thus, combining IRT information about person and item parameters with readily interpretable true-score information can positively affect the quality of test development, analysis, and decision making
The approach of combining IRT and CTT information in psychometric analysis of test data requires better understanding of the relationships between IRT concepts and their true-score counterparts from both technical and methodological perspectives. This approach is illustrated in this report with addressing the task for IRT and CTT analysis of data from the NCA test on Teaching Skills.
 
 
Testing for unidimensionality of GAT Data
 
Dimiter M. Dimitrov​
 
 
 
Testing for (uni)dimensionality of test data is of critical importance for the selection of an appropriate model for data analysis and validation in any measurement framework—classical true-score theory (CTT), item response theory (IRT), or confirmatory factor analysis (CFA). Without going into a review of existing approaches to testing for dimensionality of data, the approach used with this task is  bifactor modeling in the framework of CFA for investigating the dimensionality of data collected through the administration of the GAT assessment at the NCA. The rational for the choice is based on arguments in the research on dimensionality that the bifactor model (a) provides an evaluation of the distortion that may occur when unidimensionality models are fit to multidimensional data , (b) allows researchers to evaluate the utility of forming subscales, and (c) provides an alternative to nonhierarchical multidimensional models for scaling individual differences (e.g., Chen, West,  & Sousa, 2006; Reise, Morizot, & Hays, 2007).
The main question to be addressed with using a bifactor model is whether the data are sufficiently unidimensional to apply a unidimensional IRT model, without significant distortion in item parameters, or a multidimensional item response theory (MIRT) model is more appropriate?” Also, if a unidimensional model is appropriate, are the domain-specific clusters of items substantial and reliable enough to allow for scoring the examinees on such domains? This questions are particularly relevant to the context NCA tests as they typically assume one general dimension (factor) that underlies the examinees’ performance on the test and several domain-specific subdomains
 
 
 
 
 
Lnvestigation for potentially Exposed ltems in the Context of the verbal section of the General Aptitude test (GAT)
 
 
Iasonas  Lamprianou
 
 
This study is an extension of an exploratory study which was conducted by the same author in September 2013 under the title “Investigation for Exposed Items in the Context of the General Aptitude Test (GAT)”. The original, exploratory study aimed to investigate whether a specific group of items of the quantitative section of the test may have been exposed to the candidates. However, the General Aptitude Test (GAT) consists of a verbal section as well.
This study replicates the models and procedures of the original study (a) to investigate whether a specific group of items of the verbal section of the test may have been exposed to the candidates; and (b) to identify the response patterns of examinees who may have benefitted from the allegedly “exposeditems on the verbal section of the test.
The results suggest that only a very small number of candidates may have benefitted from the allegedly “exposed” items. Unfortunately, demographic data are not available, so it is not possible to investigate whether those candidates belong to a specific social class or reside in specific areas.
 
 
The Relationship Between test takers characteristics and their performance on the new teachers'licensure test: the case of specialty16 (special education)
 
Iasonas  Lamprianou
 
 
 
The New Teachers' Licensure Test is used by the National Center for the Assessment in Higher Education as a Licensure Test for individuals aspiring to work as school teachers. This study aimed to investigate whether personal characteristics of the test takers were correlated with their performance on the New Teachers' Licensure Test. It was decided to focus on only one subject for the moment. Specialty 16 (Special Education – visual disability) was chosen because it offered a balanced proportion of male and female candidates.
There were approximately 8300 candidates who took the test on three different occasions (in February, May and June). The candidates in February and June were male and the candidates in May were female. Linear mixed effects models were used to analyze the data.
The results gave somewhat conflicting messages. For the February examination, a practically small but statistically significant variance was found to be attributed to the University of graduation. There was also a practically small but statistically significant proportion of the total variance attributed to the interaction between the University of graduation and the performance of candidates on different sections of the examination. On the contrary, for the June examination, there was only a practically very small and statistically non-significant proportion of the variance attributed to the University of graduation.
For the May examination, the results showed that there was a practically small but statistically significant variance attributed to the University of graduation. There was also a practically small but statistically significant proportion of the total variance attributed to the interaction between the University of graduation and the performance of candidates on different sections of the examination.
The findings of the study are inconclusive. More research needs to be done using additional datasets covering other specialties which usually attract larger number of candidates. The final section of the study offers some suggestions regarding the educational,
psychometric and sociological dimensions of this issue.
 
 
Determinants of success on the New Teachers' Licensure Test
 
Iasonas Lamprianou
 
 
 
 
 
This report is extending the work done in September 2013 by the same author on a similar report which investigated the performance of test takers on the New Teachers' Licensure Test. The title of that original report was The Relationship Between Test Takers’ Characteristics and Their Performance on the New Teachers' Licensure Test: The Case of Specialty 16 (Special Education)”. That report was exploratory in nature and limited in scope because it only dealt with one subject (Special Education) and it investigated the performance of test takers on three different testing occasions.
It was important to carry out this extension study because of the prominence of the New Teachers' Licensure Test which is used by the National Center for the Assessment in Higher Education as a Licensure Test for individuals aspiring to work as school teachers. The current study aimed to investigate whether personal characteristics of the test takers (as well as other background variables) were correlated with candidates’ performance on the New Teachers' Licensure Test.
Through increasingly complex models, a number of research questions were investigated regarding the probability of individual candidates to pass the examination. The (major) findings may be summarized as:
 
·         The university of graduation has a significant effect on the probability of success.
 
·         Pre-examination training has a differential effect for different universities and Specialties
 
·         The year of graduation has an inconsistent effect on the probability of individual candidates to succeed. In most cases, candidates who graduated more recently had a lower probability of success.
 
In order to interpret these results appropriately, further investigation and deep knowledge of the context of the educational and examination system in the Kingdom of Saudi Arabia is needed.