Skip to main content
← all posts/ it assessment

How AI Scoring Works: Behind OpsTicket's Assessment Engine

OT
OpsTicket Team
2026-04-11T09:00:00+00:00IT Assessment

OpsTicket does not grade assessments with simple answer keys. Here is how our AI engine evaluates methodology, efficiency, and technical accuracy in real time.

The Problem With "Pass/Fail" Technical Hiring

A mid-sized managed service provider posted a Linux SysAdmin role last quarter. Forty-three candidates claimed proficiency with systemd, iptables, and log analysis on their resumes. When the first two hires couldn't diagnose a failed service unit without Googling basic syntax, the hiring manager stopped trusting resumes entirely. The question was not whether to add a skills screen. The question was what kind of screen would actually predict on-the-job performance.

That scenario repeats across IT hiring every week. The gap is not between candidates who know things and candidates who don't. It is between candidates who can execute under realistic conditions and candidates who have memorized the right vocabulary. Closing that gap requires capturing what someone actually does in a terminal, not what they say they would do.

This post explains exactly how OpsTicket's assessment engine works, what it measures, and why the scoring is deterministic rather than an AI judgment call.

Capturing the Full Session, Not Just the Final Answer

When a candidate starts an OpsTicket scenario, they are dropped into a live terminal environment. From that moment, the platform records a complete session log: every command entered, every flag used, every output returned, the sequence of actions, and the elapsed time between steps. Nothing is inferred or reconstructed after the fact. The raw log is the ground truth.

This matters because the final answer is the least informative part of a troubleshooting session. Consider a scenario where a candidate needs to restore connectivity after a misconfigured firewall rule. Two candidates both fix the rule and restore connectivity. One ran iptables -L to inspect the current ruleset, identified the offending rule, removed it with a targeted iptables -D command, tested with ping and curl, and then persisted the change. The other ran iptables -F, which flushed every rule in the chain, restoring connectivity by accident while leaving the system completely unprotected. Both "passed" a pass/fail screen. Only the session log reveals which candidate you actually want on your team.

Five Scoring Dimensions, Each Measured Independently

OpsTicket evaluates every session across five dimensions. Each dimension receives an independent score on a 0-to-100 scale, and the composite score is a weighted average that hiring managers can tune by role. More importantly, recruiters see the full breakdown, not just the composite.

Technical Accuracy

Did the candidate resolve the issue correctly and completely? This dimension checks the actual system state at the end of the session against a defined correct state. For a DNS scenario, that means verifying that resolution works for the specified hostname, that the correct nameserver is configured, and that no collateral changes were introduced. Partial credit is possible: a candidate who fixes the immediate symptom but leaves a secondary misconfiguration in place scores differently than one who addresses both.

Methodology

Did the candidate follow a logical diagnostic process? Methodology scoring compares the candidate's sequence of actions against expert-defined troubleshooting paths for that specific scenario. For a service failure, the expected pattern might be: check service status, read the journal log, identify the error, trace it to a configuration file, correct the value, restart the service, verify. A candidate who jumps straight to reinstalling the package without reading a single log line scores low on methodology even if the reinstall happens to work.

Efficiency

Did the candidate take a direct path, or did they spend time on commands unrelated to the problem? Efficiency scoring penalizes noise: running top repeatedly when CPU is not relevant to the scenario, listing directory contents that have no bearing on the fault, or issuing the same diagnostic command five times in a row. It does not penalize reasonable exploration. The rubric distinguishes between structured investigation and unfocused wandering.

Tool Proficiency

Did the candidate use standard utilities correctly, including appropriate flags and options? A candidate who knows that ss -tulnp shows listening sockets with process names demonstrates a different level of familiarity than one who only knows netstat with no flags. Tool proficiency scoring maps specific commands and their arguments against a proficiency rubric for the track. Helpdesk scenarios weight different tools than cloud/DevOps scenarios, and the rubrics reflect that.

Time Management

Did the candidate allocate effort appropriately across scenario components? Some scenarios have multiple fault conditions. A candidate who spends 80 percent of the session on the first issue and never reaches the second one scores differently than one who triages both and resolves them in priority order. Time management scoring is not about raw speed. It is about proportional effort relative to the complexity of each component.

How the Rubric Engine Identifies Patterns and Anti-Patterns

The scoring engine does not use a language model to read the session and form an opinion. It applies a deterministic rubric: a structured set of rules that map specific observable behaviors to specific score adjustments. Every rule was written by a subject-matter expert for that track, reviewed against professional best practices, and tested against sessions from practitioners with verified experience.

For a given scenario, the rubric defines positive indicators (commands and sequences that demonstrate competence) and anti-patterns (behaviors that indicate risk or poor practice). Anti-patterns reduce methodology and efficiency scores even when the candidate reaches the correct final state. Common anti-patterns across tracks include:

  • Running commands with elevated privileges when standard user permissions are sufficient
  • Editing configuration files without first creating a backup
  • Restarting a service before reading its logs to understand why it failed
  • Making changes without verifying the baseline state first
  • Using chmod 777 as a troubleshooting step for permission errors
  • Disabling a firewall entirely rather than adding a targeted rule

These behaviors matter in production. A candidate who habitually skips backups before editing configs is a liability regardless of their accuracy score. The rubric captures that signal explicitly.

Why Deterministic Scoring Matters for Hiring Decisions

The word "AI" in hiring contexts has become associated with opaque, unexplainable verdicts. OpsTicket's engine is not that. Every score can be traced back to specific rules applied to specific observed behaviors. A recruiter can look at a candidate's methodology score of 42 and see exactly which expected steps were missing and which anti-patterns were triggered. There is no black box, no model confidence interval, no probabilistic judgment. The same session run through the engine twice produces the same score.

This has practical consequences for hiring teams. When a candidate disputes a score, the session log and the applied rules are both available for review. When a hiring manager wants to weight efficiency more heavily for a senior role, they can adjust the composite weighting and see how the candidate pool re-ranks. When a team wants to set a minimum methodology score for candidates who will work in regulated environments, they can do that without rerunning assessments.

Reducing Bias Through Behavioral Evidence

Resume screening correlates with school prestige, job title inflation, and writing quality. Unstructured interviews correlate with confidence, communication style, and how comfortable a candidate makes the interviewer feel. Neither of those signals is what you are trying to measure when you need someone to diagnose a network fault at 2 a.m.

OpsTicket scoring is blind to candidate identity. The engine receives a session log and applies a rubric. It has no access to the candidate's name, school, years of claimed experience, or demographic information. The score reflects what the candidate did in the terminal, nothing else. IT Custom Solution, the company behind OpsTicket, audits scoring rubrics regularly for disparate impact to ensure that no dimension inadvertently proxies for demographic characteristics rather than technical skill.

What the Score Report Shows

Candidates receive a score report within seconds of completing a scenario. The report shows the composite score, the five dimension scores, a summary of strengths, and specific areas where the session deviated from the expert rubric. Candidates can share the report with employers via a unique link, and hiring managers can verify the score independently at tryopsticket.com.

The report also includes targeted guidance based on identified gaps. A candidate who scored low on tool proficiency for a networking scenario sees specific utilities they should practice. A candidate who scored low on methodology sees the expected diagnostic sequence they missed. This makes the assessment useful for candidates regardless of the hiring outcome.

OpsTicket covers six tracks: helpdesk, networking, cybersecurity, cloud/DevOps, Linux SysAdmin, and AI foundations. Pro tier access is $49 per month. Full pricing details are at tryopsticket.com/pricing.

The Practical Takeaway

If your hiring process produces candidates who look identical on paper and then perform very differently on the job, the problem is the signal you are collecting. A session log scored against a deterministic rubric gives you five independent behavioral signals from a realistic work sample. That is more predictive than a resume, more consistent than an interview, and more defensible than either. The score tells you what the candidate did. You decide what that means for the role.

Ready to prove it?

One scenario, ~15 minutes, free for candidates. Walk away with a verified score.

Take an assessment →