How Georgia's Assessments Work

The State of Georgia is a leader in producing high-quality, balanced assessment systems that support improved teaching and learning for all of our students. The Georgia Department of Education (GaDOE) creates and administers assessments that are designed to support a variety of formative and summative educational purposes, such as:

providing teachers with important insights about learning,
informing parents about individual students' levels of achievement, and
informing policymakers about achievement outcomes for groups of students.

On this site, GaDOE provides information about the processes and procedures that are used to develop, balance, and maintain Georgia's assessment systems.

Technical Quality

The technical quality of Georgia assessments is evaluated to ensure all steps followed during development result in valid and reliable scores. Georgia Milestones and the GAA 2.0 are scored using Item Response Theory (IRT), a framework which has been widely adopted for test score scale maintenance across the assessment industry, including essentially all states. The process of scoring – or, applying and evaluating a scoring model – involves three main steps: calibration, equating, and scaling. Calibration involves using an IRT model to produce item difficulty estimates and person ability estimates. Even after applying consistent construction procedures, assessments may still differ slightly in difficulty from one year to the next. For this reason, the assessments are also equated. Equating eliminates differences in difficulty, such that the level of performance that it takes students to earn a given score and to reach the achievement levels on the tests is held constant. Thus, scale scores and achievement levels within a given grade and content area are comparable across administrations and years. Scaling these outcomes involves applying the achievement standards set by Georgia educators to place person ability estimates on a scale with consistent achievement cut points.

Some Georgia assessments do not use IRT and thus follow a different scoring process than what is listed above. For example, GKIDS 2.0 uses a rater-based scoring process where teachers rate students' demonstration of mastery as achievement levels within a progression. Keenville uses an innovative game-based leveling approach based on the accuracy rate the student achieves. All Georgia assessments follow the general scoring philosophy on giving students credit for the mastery of the academic content standards they have demonstrated, not taking credit away.

In addition to the review processes described in all prior steps, Georgia convenes a Technical Advisory Committee (TAC), comprised of six nationally-recognized experts in the field of educational measurement. The purpose of the TAC is to provide the GaDOE with impartial, expert guidance on the technical qualities of all state assessments. The TAC reviews every step of the assessment development, scoring, and reporting process for each assessment program.

The technical quality of Georgia's summative assessments is also externally evaluated by the U.S. Department of Education in a process known as peer review. During peer review, each state must submit detailed documentation providing evidence of the technical qualities of the assessment systems. These pieces of evidence include test design and development, validity, reliability, fairness and accessibility, scoring, and technical analysis and ongoing maintenance. A committee of peers (measurement, curriculum, and education policy experts) selected from other states reviews the evidence and evaluates the overall quality and soundness of the assessment system. Through these external technical quality reviews, the rigorous standards for Georgia assessments are maintained.

What makes an assessment valid or reliable?

Validity refers to the degree to which evidence and theory support the intended interpretation and use of test scores. Establishing and evaluating validity is multi-faceted process of collection of evidence over time, starting with the initial development of a test design and continuing through clear achievement reports which support the intended interpretation and use. Considerations include alignment with content standards, creation of test and item specifications, multiple reviews by educators, careful form construction, and psychometric work to ensure equivalent and accurate results. Ultimately, the goal of an assessment program is to support a valid interpretation and use of achievement results.

Reliability in assessment specifically refers to the degree to which test scores for a group of test takers are consistent and stable over time. In other words, a reliable assessment is one that produces stable scores if the same group of students were to take the same test repeatedly without any fatigue or memory effects. This may be calculated different ways depending on the assessment design and for assessment programs such as Georgia Milestones and the GAA 2.0, it is evaluated carefully at each operational administration of the assessments.

Technical documentation and additional information

Several technical documents are available on the GaDOE website to provide information on how Georgia's assessments are developed, administered, scored, and evaluated: Technical Documents (gadoe.org)

On this page, you can find:

Technical reports for Georgia Milestones, GAA 2.0, and ACCESS for ELLS
U.S. Department of Education peer review results and approvals
Technical Advisory Committee statements
Additional briefs on specific technical topics such as alignment, cut score determination, validity, and reliability

Spotlight On: Assessment Research

In addition to standard operational research to maintain the technical quality of Georgia-grown assessments, ongoing research is conducted to explore how these assessments function and relate to each other. For example, recent research was conducted to evaluate the time students spend on Georgia Milestones items and the total test. This study sought to verify whether most students were completing the test in the expected range of time, and to ensure the published “typical" ranges were accurate across demographic subgroups. The results of this study validated the identified ranges as appropriate for Georgia students, and identified the maximum as the point at which sitting longer for the test did not result in higher achievement. This year, after verifying the validity of posted ranges and maximums, further research continues to explore how much time students spend on different item types, and the relation to achievement and engagement.

Technical Quality

What makes an assessment valid or reliable?

Technical documentation and additional information

Spotlight On: Assessment Research

Spotlight On: Ongoing Research

Questions? Comments?

Technical Quality

​What makes an assessment valid or reliable?

Technical documentation and additional information

Spotlight On: Assessment Research

Spotlight On: Ongoing Research

Questions? Comments?

What makes an assessment valid or reliable?