National AI Toolkit for Citizen Services

(Draft plan for Government, Academia, Industry and Citizen Collaboration)

 

 

Date: 23th September 2025

 

Initial Draft Authors

Jerry Sweeney

Managing Director, CloudCIX Limited

https://www.cix.ie

https://www.cloudcix.com

 

Tung Tran

Ph.D. Researcher, University College Cork

 

Joseph McInerney

Research Assistant, Trinity College Dublin

 

Harry Nguyen

Associate Professor/Director (MSc CS), University College Cork

 

 

Contents

Table of figures............................................................................................................................ 3

Acronyms.................................................................................................................................... 3

Web References........................................................................................................................... 4

Executive Summary...................................................................................................................... 5

1        Introduction and Background................................................................................................ 6

1.1            Introduction................................................................................................................. 6

1.2            Language support......................................................................................................... 6

2        Vision & Objectives............................................................................................................... 7

2.1            Vision.......................................................................................................................... 7

2.2            Scope.......................................................................................................................... 8

2.2.1        In Scope................................................................................................................... 8

2.2.2        Out of Scope............................................................................................................ 8

2.3            End Use Applications.................................................................................................... 9

2.4            Objectives & Benefits Summary................................................................................... 10

3. Governance & Ethical Issues.................................................................................................... 11

3.1            Legislative Requirements............................................................................................ 11

3.1.1        EU Artificial Intelligence (AI) Act.............................................................................. 11

3.1.2 General Data Protection Directive (GDPR 2016)............................................................ 11

3.1.3        Data Governance Act (DGA 2022)............................................................................ 11

3.1.4        European Health Data Space Regulation (EHDS 2025)............................................... 11

4. Work Schedule....................................................................................................................... 12

4.1            Challenges.................................................................................................................. 12

4.1.1        Technical................................................................................................................ 12

4.1.2        Enviromental, Economic & Infrastructural................................................................ 12

4.1.3        Social & Ethical....................................................................................................... 12

4.1.4        Legal & Policy......................................................................................................... 12

4.1.5        Practical Deployment Challenges............................................................................. 12

4.2            Dataset Collection...................................................................................................... 13

4.3            Model Selection & Finetuning..................................................................................... 14

4.5            Roadmap & Timeline.................................................................................................. 16

5. Conclusion.............................................................................................................................. 16

Acknowledgement...................................................................................................................... 16

Appendix 1: Publications and achievements to date..................................................................... 17

A1.1         Tung Tran Ph.D. candidate, UCC.................................................................................. 17

A1.2 Joseph McInerney Research Assistant, Trinity College Dublin............................................. 19

A1.3         McInerney & Tran: 100 years of Irish in the Oireachtas.............................................. 19

Appendix 2:          Simple Example Applications............................................................................... 22

A2.1         Custom Chatbot for a Website.................................................................................... 22

A2.2         Central Statistics Office............................................................................................... 23

 

 

 

Table of figures

Figure 1  AI Customer Services Applications Software Stack............................................................ 7

Figure 2  Proposed Application Use Cases...................................................................................... 9

Figure 3  Application Architecture (NVIDIA 2025)........................................................................... 9

Figure 4  Objectives & Benefits.................................................................................................... 10

Figure 5  Proposed steps to train foundation bilingual Irish-English language model....................... 15

Figure 6   Timeline, first six months.............................................................................................. 16

 

Acronyms

AI         Artificial Intelligence  /  Ard Intellacht

AIAI      Artificial Intelligence Association of Ireland (https://aiai.ucd.ie/)

API       Application Program Interface

CSO      Central Statistics Office

EI         Enterprise Ireland

EHDS    European Health Data Space regulation (EHDS 2025)

EN        English  /  Béarla

GA        Irish  /  Gaeilge

GDPR    General Data Protection Regulation

ICT       Information and Computing Technology

IDA       Industrial Development Authority

IPS       Identity Provider Service

LLM      Large Language Model

MCP     Model Controller Protocol

ML       Machine Learning

NIS2     Network and Information Security Version 2 (EU Directive 2022/2555)

RAG      Retrieval Augmented Generation

SLM      Small Language Model (for this document defined as < 32B 16-bit parameters)

UCC      University College Cork

 

 

 

Web References

Name Link Notes
Artificial Intelligence for Ireland (AIFI) https://aiforireland.ie// AI for Ireland's website
Artificial Intelligence Association of Ireland (AIAI) https://aiai.ucd.ie/ Representative Association
CeADAR https://ceadar.ie/ CeADAR is Ireland’s national centre for AI funded by Enterprise Ireland and IDA.
Connecting Government 2030 https://assets.gov.ie/static/documents/connecting-government-2030-a-digital-and-ict-strategy-for-irelands-public-service-e7a3.pdf A Digital and ICT Strategy for Ireland’s Public Service
EU Artificial Intelligence (AI) Act https://enterprise.gov.ie/en/what-we-do/innovation-research-development/artificial-intelligence/eu-ai-act/
National AI Strategy https://assets.gov.ie/static/documents/national-ai-strategy-refresh-2024.pdf 2024 Refresh
Official Languages Act 2003, Amended 2021 https://www.gov.ie/en/department-of-rural-and-community-development-and-the-gaeltacht/publications/official-languages-act/#copy-of-legislation
Small Language Models are the Future of Agentic AI
https://arxiv.org/pdf/2506.02153 2nd June 2025
NVIDIA Research
Research Ireland Centre for Research Training in Artificial Intelligence
https://www.crt-ai.ie
Insight Research Ireland Centre for Data Analytics, UCC
https://www.insight-centre.org
ENTIRE EDIH Ireland https://entire-edih.ie

 

 

Executive Summary

University College Cork (UCC) and CloudCIX Limited want to revolutionise the efficiency of citizen/customer service for the Irish Government and businesses using ethical, sovereign, agentic, bilingual AI. This document is to inform potential stakeholders and gather support for the project.

UCC and CloudCIX have been working jointly since January 2024 to develop a Machine Learning (ML) toolkit designed to assist in the development of customer service applications for government and private companies. Post graduate students from other colleges have also contributed data sets. UCC and CloudCIX wish to setup a community to manage the further development and commercialisation of the outputs of this research.

·       Open to all students, academics, and professionals working in Ireland and interested in the broad field of Artificial Intelligence.

·       Government agencies such as Enterprise Ireland (EI) CeADAR, and Údarás Na Gaeltachta will be invited to oversee its operations.

·       An organisation will be selected, or a new organisation will be formed, to manage the new larger community group. One possibility is that a subcommittee of Artificial Intelligence Association of Ireland (AIAI) could take on this role of managing this community. Participants in the project can join this group to decide strategy and toolkit roadmap.

·       The community will investigate EU-based Large Language Models (LLMs) to select/develop one or more Small Language Models (SLMs – lesser than 30B for ease of deployment) suitable for training for different applications. Data sets can be contributed by the community in return for input in the development of the base LLM/SLM model(s).

·       Define national toolkit landscape for government, companies and citizens, including localised question-answering, translation tools, document retrieval and summarisation.

By concentrating on customer service applications in service delivery and using SLMs finetuned to Irish culture and data retrieval systems, concerns about hallucinations and unethical/discriminatory responses can be eliminated.

1     Introduction and Background

1.1    Introduction

“AI holds great promise for the delivery of better Public Services, and there is a strong ambition across the public service to harness trustworthy AI for this purpose, as part of the digital transformation of the Public Service.”  (National AI Strategy, 2024 Refresh)

 

“We (Irish Civil Service) will deliver our public services taking a digital by default approach through collaboration with our stakeholders and the public, building towards the target for 2030 set by the Civil Service Renewal strategy of ensuring that 90% of applicable services are consumed online.  In delivering a human-driven experience, we will take a universal design approach, so that the services are available on a 24-7 basis, on the device of choice and are delivered in an equitable, inclusive and sustainable manner.” (Connecting Government 2030)

 

 

In alignment with the National Strategies quoted above, this document is a call to form a ‘coalition of the willing’ to establish a sovereign, culturally relevant AI ecosystem that empowers citizens, government agencies, and enterprises. The project’s cornerstone is the development of a set of models and tools called the ‘AI Toolkit, that can later be used to build secure, language-inclusive, and citizen-focused agentic services. The project is called AI For Ireland (AIFI).

University College Cork and CloudCIX Limited began joint work in January 2024 to develop a bilingual EN/GA LLM and other AI/ML tools and models, for use in Ireland by both government and the private sector, to build safe agentic applications and to ensure data sovereignty.

·       UCC and CloudCIX team

·       GPU and compute resources have been donated to the project by CloudCIX Limited.

·       A list of research achievements to date is included in Appendix 1

·       The current state of the AI toolkit can be seen by visiting

o   https://www.cloudcix.com/ml.html

UCC and CloudCIX now wish to extend the resources deployed to support this initiative by working with other interested parties on this common goal, hence the publication of this document.

 

1.2    Language support

English and Irish (Gaeilge) are the official languages of the Republic of Ireland and Northern Ireland.

The Constitution of the Republic of Ireland permits the public to conduct its business – and every part of its business – with the State solely through Irish. As a result, public bodies have a duty to comply with this right (Official Languages Act 2003, Amended 2021).

For these reasons, tools developed under this project must be designed for the provision of bilingual customer service solutions.

 2    Vision & Objectives

2.1    Vision

“Artificial Intelligence for Ireland - Government, Academia, Industry and Citizen”

Figure 1 represents an entire software stack that can be used to build customer services applications. This document is a request for cooperation in developing layer 3, the National Toolkit, and the associated training data sets.

 

Figure 1  AI Customer Services Applications Software Stack

Examples of ML models contained in the AI Toolkit.

·       Bilingual LLM(s) finetuned with culturally relevant datasets

·       Embedding Models for vector database applications

·       EN/GA and GA/EN Translation Models

Examples of non ML tools contained in the Toolkit.

·       MyGovID Interaction API

·       API/MCP integration tools

 

 

 

2.2    Scope

2.2.1   In Scope

The scope of the project being proposed is to build either an Irish Government regulated and/or open source toolkit that can be used to develop agentic customer service applications.

The key component of the Toolkit will be one or more SLMs that will be fine tuned for EN/GA Agentic Applications. These SLMs must be optimised for use with data returned from retrieval technologies such as API, MCP, JSON-stat and vector databases.

The compliment the SLMs, the toolkit will include models to do tasks such as create and retrieve embeddings, perform EN/GA and GA/EN translation, and non ML tools that will interact with document retrieval systems and Identity Provision Services (IPS) such as MyGovID. The current version of the toolkit is hosted at this URL[1].

2.2.2   Out of Scope

Other than to be used as proof of concepts, this project will not develop ‘end use customer service applications’. It is expected that Government agencies will develop specifications for such applications and go to tender to have them developed. Meeting the requirements of the tendering process will hopefully be easier (and hence less expensive) because of the freely available Toolkit.

·       Section 2.3 explains what types of end use applications are under consideration.

·       Section 2.4 describes the challenges of such applications and how these challenges can be addressed.

·       Appendix 2 contains several prototype applications for illustrative purposes.

 

 

2.3    End Use Applications

An initial analysis of potential applications suggests that there are four possible application types and use cases that the National AI toolkit might be used to address. These application types will only give rise to minimal risks and consequently, can be marketed and used subject to the existing legislation without additional legal obligations under the AI Act. Responsibility for GDPR compliance will rest with the developers of the Application rather than with the developers of the National AI Toolkit.

Application Type Example Use Cases
Unstructured public data distribution. No login required. Accessing public planning application documents.
Structured public data distribution. No login required. Accessing CSO public data.
Structured Private data distribution. Login required. Using MyGovID or other IPS to access personal medical records.
Transactional private data processing. Login required. Using MyGovID to order a driver’s licence.
Using MyGovID to order a passport.

Figure 2  Proposed Application Use Cases

It is important to note that all these proposed use cases are data retrieval systems. It is better not to think of them as Retrieval Augmented Generation (RAG) systems because they are in effect Generation Augmented Retrieval systems. The SLM(s) will be used interpret the request from the end user. The System/User prompts will be engineered to prevent the application from answering any question where the answer is not contained in the underlying external data. This architecture eliminates concerns about hallucinations and unethical/discriminatory responses.

Figure 3 contains a schematic from the Nvidia Research 2024 paper “Small Language Models are the Future of Agentic AI” that deals with the architecture being proposed.

 

Figure 3  Application Architecture (NVIDIA 2025)

 

2.4    Objectives & Benefits Summary

Strand Objectives Benefits
Cooperation Encourage multi-stakeholder collaboration between government, academia, and private industry. Published open research, tools, problem statements, and agendas for advancing bilingual (EN/GA) language technologies.

Advancement of Ireland as a centre for AI/ML research and development.
Publication & Authorship Publish papers and open-source data sets and tools. Academic institutions can align their research with this common objective.
Language processing skills Build human knowledge and expertise in Irish for data annotations, validations, and evaluations, especially in language processing applications. Allow government agencies meet their statutory requirement to offer services bilingually.

Increased availability/use of GA will help preserve and grow the language.
Data sets EN, GA and GA/EN parallel data contributions: pre-collected/processed Irish corpora, curated data (e.g., transcripts, debates, TG4, etc), annotations in various formats (e.g., questions and answers). A growing corpus of curated data, rich in Ireland’s culture and free of undesirable data to be used to finetune SLMs and LLMs.
Tools Fine tune one or more LLMs or SLMs to act as a base for agentic customer service applications. Cost effective citizen service.
Training Infrastructure To provide HPC infrastructure for research. CloudCIX Limited has committed to €200,000 of compute credits per annum on the Boole Supercomputer toward development of the Model Zoo.
Inference Infrastructure Establish sovereign deployment infrastructure creating a foundation for trustworthy, ethical, and transparent AI deployment at scale in Ireland. Ensure agentic AI applications can be deployed in Ireland with confirmed data sovereignty allowing control over sensitive citizen information.

Figure 4  Objectives & Benefits

 

 

3. Governance & Ethical Issues

3.1    Legislative Requirements

The project must be aware of and aligned with a number of Irish/EU legislative requirements.

3.1.1   EU Artificial Intelligence (AI) Act

The EU Artificial Intelligence (AI) Act is an EU regulation which entered into force on 2 August 2024 and is directly applicable across the EU.

The EU AI Act establishes a risk-based regulatory framework for artificial intelligence within the European Union. It classifies AI systems into four categories—unacceptable risk, high-risk, limited risk, and minimal risk—with stricter obligations for higher-risk applications

The Toolkit and underlying applications must align with this legislation.

3.1.2 General Data Protection Directive (GDPR 2016)

This will be an issue for the applications developed using the Toolkit.

3.1.3   Data Governance Act (DGA 2022)

This legislation covers data under a public authority's control, accessed remotely.

3.1.4   European Health Data Space Regulation (EHDS 2025)

For medical record information this may be relevant for applications.

 

 

 

 

 

4. Work Schedule

4.1    Challenges

The work of the proposed group will be to address many challenges. This section lists these challenges with some comments. Addressing these challenges will be among the key objectives of the group.

4.1.1   Technical

The Toolkit must deliver on several technical fronts to allow subsequently developed applications to be effective.

1.     Build bilingual data sets required to meet challenge 2.

2.     Locate base SLMs and fine tune them to:

.        Work bilingually in EN/GA.

.        Be cognisant of Irish cultural norms and understand question and prepare answers in that context.

.        Interface with data sources using MCP, JSON-stat, APIs.

.        Interface with unstructured data by means of embedding databases.

.        Prevent users from escaping constraints applied by system and user prompts.

3.     Build tools that perform tasks such as translation and document embedding.

4.     Build tools that perform speech to text and text to speech, bilingually.

4.1.2   Enviromental, Economic & Infrastructural

AI can be power hungry and therefore expensive. It is vital to have strategies that mitigate environmental and economic costs. Not training an SLM from scratch, but rather using pre trained base models and finetuning them in an example of such a strategy. An objective of all the tools developed for the Toolkit is that they can run on minimal hardware. For example, the Mistral 24B SLM can be run on a single Nvidia H100 GPU at a power cost of 1kW. The same SLM can handle queries from multiple applications allowing scaling to occur as required with committing resources that are not utilised.

4.1.3   Social & Ethical

While AI provides many opportunities there are social risks and ethical dilemmas that need to be addressed. However, this project will be focused on customer service applications and by virtue of this narrow focus will avoid many of these challenges.

4.1.4   Legal & Policy

While the project needs to be compliant with Irish and EU legislation including the EU AI Act and GDPR, the narrow focus of the application set being considered limits this exposure.

4.1.5   Practical Deployment Challenges

Ireland must host both the Toolkit and subsequent applications within sovereign data centres.

4.2    Dataset Collection

Datasets of various types are required, including factual information, question–answer pairs, instructions and responses, retrieval-augmented generation (RAG) samples, and tool/function-calling examples. All datasets must be of high quality, truthful, and safe.

A strong focus on Irish-language resources is essential, complemented by carefully filtered English data to ensure cultural and value alignment with Irish society. At present, the available Irish-language data is estimated at under 1 billion tokens, whereas fine-tuning billion-parameter scale language models typically requires trillions of tokens. Addressing this data scarcity will be a core challenge.

While textual data will remain the most important modality, multimodal data (e.g., text–speech, text–image, and conversational interaction datasets) can also play a role, especially in citizen service contexts where accessibility and inclusivity are priorities.

All datasets must be:

·       Open-source where possible, to ensure transparency and public trust.

·       Legally compliant, respecting copyright, GDPR, and data protection requirements.

·       Well documented, to enable reproducibility and accountability.

In parallel with training data, a comprehensive evaluation dataset must be developed to benchmark model performance. Evaluation data should be representative of real-world citizen service needs, such as customer service dialogues, frequently asked questions, and government support scenarios.

Existing Irish resources provide a starting point:

·       UCCIX corpus: A collection of raw text data from multiple corpora, filtered for Irish language.

·       IrishQA: A benchmark for question–answering tasks in Irish.

·       IrishMultiJail: A dataset for testing safety alignment and culturally sensitive refusal behaviour in both Irish and English.

·       IRLBench: A benchmark suite for evaluating bilingual performance using the Leaving Certificate Examination.

These resources should be systematically inspected, cleaned, extended, and maintained. Data creation will be required through partnerships with public sector agencies, and collaboration with academia and industry.

 

 

4.3    Model Selection & Finetuning

It is proposed to work with EU-based LLMs (e.g., Mistral) to develop one or more SLMs that can be used as base models for fine tuning. The goal is to produce bilingual, culturally grounded, and safe language models and tools optimized for citizen service applications in Ireland.

Several factors need to be considered in model selection and fine-tuning:

1.     Data Utilization

-              Designing effective pre-training and fine-tuning pipelines to maximize useful information from scarce Irish-language data.

-              Exploring data augmentation, synthetic data generation, and cross-lingual transfer learning to scale Irish resources without compromising quality.

2.     Bilingual & Culturally Sensitive Adaptation

-              Enhancing models to seamlessly switch between Irish and English, supporting code-switching where appropriate.

-              Embedding cultural context, local norms, and values into the training process to ensure outputs resonate with Irish citizens.

-              Leveraging Irish-specific evaluation benchmarks (e.g., IRLBench) to ensure bilingual robustness.

3.     Safeguarding

-              Incorporating safety alignment mechanisms, such as instruction filtering, red-teaming in both languages, and agent-in-the-loop evaluation with dataset such as IrishMultiJail.

-              Ensuring compliance with EU AI Act principles around transparency, robustness, and accountability.

-              Applying continual monitoring and auditing pipelines to track performance drift and prevent misuse.

4.     Survey of Literature & Best Practices

-              Reviewing state-of-the-art bilingual and low-resource NLP techniques (e.g., parameter-efficient fine-tuning, multilingual transfer, and adapters).

-              Learning from similar national AI efforts (e.g., Swiss AI Initiative) to avoid duplication and adapt successful strategies.

Figure 5  Proposed steps to train foundation bilingual Irish-English language model

 

4.5    Roadmap & Timeline

Date Milestone Notes
Sept 1st 2025 Create a draft plan. Initial circulation to Údarás na Gaeltachta and AIAI to get initial support.
Sept 18th 2025 Publicly announce the community formation and arrange the first meeting of interested parties. Decide whether this can be part of an existing organisation (preferred) or setup a new organisation to manage the community.
Announce to coincide with National AI Meet on Thursday 18th Sept in Galway.
October 2025 Review existing landscape. Access feedback from interested parties. Perhaps a series of virtual presentations by existing practitioners.
Publicise the ‘path to success vision’. Address concerns around ethics, confidentiality and hallucination.
November 2025 Decide on priorities and assign groups to each priority.
November 2025 Publish a website outlining the work being undertaken. A content platform like Sphinx or a Wiki would allow collaboration on content. We want to avoid the need for a centralised editor.
1st / 2nd December 2025 Announce the launch of the community. Use the AIAI conference as the initiative launch event.
January 2026 Engage with government and semi state organisations to gather needs and prepare POCs. Use the information gathered to update this document.
February 2026 Form a committee to define strategy and manage deliverables.

Figure 6   Timeline, first six months.

5. Conclusion

Ireland needs ethical, sovereign, agentic, bilingual artificial intelligence to meet the needs of its citizens and businesses. The proposed community will facilitate discussion and standardisation, as well as accelerate the development of AI frameworks and toolkits. Everyone working in Ireland and interested in the broad field of Artificial Intelligence should be included in this national initiative, fostering long-term and sustainable benefits for the nation, including huge job opportunities, new revenue streams and technological breakthroughs.

Acknowledgement

Tung Tran and Harry Nguyen are supported by the Taighde Éireann–Research Ireland under Grant 18/CRT/6223 and 12/RC/2289-P2, which are co-funded under the European Regional Development Fund.

Appendix 1: Publications and achievements to date.

A1.1  Tung Tran Ph.D. candidate, UCC

[RESEARCH PAPER, MODEL & DATA] UCCIX & UCCIX‑2 – ECAI2024

[RESEARCH PAPER & MODEL] UCCIX-Translate – ACL2025 Workshop on Technologies for Machine Translation of Low-Resource Languages

[RESEARCH PAPER  & MODEL] UCCIX-Reasoning – Under submission at AAAI2026

[RESEARCH PAPER  & DATA] IRLBench – Under submission at KDD2026 – Datasets and Benchmark Track

[RESEARCH PAPER] Multi-Agent Collaboration Mechanisms: A Survey of LLMs – Under submission at ACM Computing Surveys

[RESEARCH PAPER] Disentangling Language Understanding and Reasoning Structures in Cross-lingual Chain-of-Thought Prompting – EMNLP2025

[APPLICATION] CloudCIX RAG Workbench

[APPLICATION] CloudCIX Guiden - Agentic Chatbot

Community Engagement

A1.2 Joseph McInerney Research Assistant, Trinity College Dublin

·       MSc Artificial Intelligence, Queen’s University Belfast.

·       Currently enrolled in C1 Gaelchultúr Irish language course.

[RESEARCH PAPER] Qomhrá: A Bilingual Irish/English Large Language Model – Intention to Submit to LREC2026  

Visual Overview: 

 

 

A1.3  McInerney & Tran: 100 years of Irish in the Oireachtas

[DATA] The Dáil Data

 

 

 

Analysis

This work presents an analysis of the Oireachtas corpus covering more than a century of Irish parliamentary proceedings, spanning debates, summaries, reports, and questions and answers. The dataset is bilingual, containing both Irish and English, and is estimated to comprise between 700M and 1B tokens (words).

The figures demonstrate initial analyses already carried out. The first bar chart tracks the total number of tokens by year, showing steady growth in parliamentary text over time, with a sharp increase from the early 2000s onward. The second chart measures the proportion of Irish tokens by year, highlighting fluctuations in the presence of Irish. The third set of pie charts presents topic distributions, comparing overall, English-only, and Irish-only usage. While English debates emphasize Health, Education, and Finance, Irish ones mainly focus on the development of the Irish language itself (Community, Rural and Gaeltacht Affairs, Arts, Tourism, and Culture).

Additional analyses are planned as future work, including event detection, code-switching patterns, sentiment analysis across topics and time, duplication checks against other Irish corpora, and knowledge coverage benchmarking against resources such as Irish Wikipedia and UCCIX. Together, these steps will provide a deeper understanding of Irish usage in parliament and its value for both linguistic research and language model development.

 

Appendix 2:   Simple Example Applications

A2.1  Custom Chatbot for a Website

 The current version of the Toolkit can be located at… https://www.cloudcix.com/ml.html

 

A backend application was developed to utilise these services to build a chatbot for any website and a Chatbot Workbench was developed to create the simple javascript that needs to be embedded in a webpage to bring the application alive. This video shows how simple the workbench is to use…

https://youtu.be/GdQTZc_56UQ?si=C7fiqoWqBzqTGME5

 

To test out an example of the chatbot visit the following website…

https://energywiseireland.ie/

The instance of the chatbot has access to a vector database that contains information related to renewable energy. The company got permission from SEAI to scrape their website for must of the data. You can test the chatbot to confirm that it does not hallucinate and only answers questions that are related to the underlying corpus.

 

 

 

A2.2  Central Statistics Office

A final year undergraduate student, Barry Halloran, under supervision by Tung Tran completed a final year project for data retrieval for the Central Statistics Office (CSO). This project builds a chatbot to help users access CSO data, which is published as multidimensional data cubes. The CSO provides free, high-quality statistics on society, economy, and environment, and the chatbot makes this information easier to use through natural language queries rather than manual navigation of the data itself.

When a user asks a question (e.g., “What was the employment rate for males in Dublin in 2002?”), the system converts it into an embedding vector and uses FAISS to find the most relevant data cube. Neo4j is then used to explore the cube’s structure, checking available dimensions such as year, region, gender, or statistic. If key details are missing, the chatbot asks clarifying questions before proceeding.

Finally, the system generates a Cypher query to extract the requested data and returns the result conversationally. By combining natural language processing, similarity search, and graph database queries, the chatbot provides an interactive, accurate, and user-friendly way to explore CSO data cubes.

Demo: demo_vid | media.heanet.ie

 



[1] https://www.cloudcix.com/ml.html