Name	Link	Notes
Artificial Intelligence for Ireland (AIFI)	https://aiforireland.ie//	AI for Ireland's website
Artificial Intelligence Association of Ireland (AIAI)	https://aiai.ucd.ie/	Representative Association
CeADAR	https://ceadar.ie/	CeADAR is Ireland’s national centre for AI funded by Enterprise Ireland and IDA.
Connecting Government 2030	https://assets.gov.ie/static/documents/connecting-government-2030-a-digital-and-ict-strategy-for-irelands-public-service-e7a3.pdf	A Digital and ICT Strategy for Ireland’s Public Service
EU Artificial Intelligence (AI) Act	https://enterprise.gov.ie/en/what-we-do/innovation-research-development/artificial-intelligence/eu-ai-act/
National AI Strategy	https://assets.gov.ie/static/documents/national-ai-strategy-refresh-2024.pdf	2024 Refresh
Official Languages Act 2003, Amended 2021	https://www.gov.ie/en/department-of-rural-and-community-development-and-the-gaeltacht/publications/official-languages-act/#copy-of-legislation
Small Language Models are the Future of Agentic AI	https://arxiv.org/pdf/2506.02153	2nd June 2025 NVIDIA Research
Research Ireland Centre for Research Training in Artificial Intelligence	https://www.crt-ai.ie
Insight Research Ireland Centre for Data Analytics, UCC	https://www.insight-centre.org
ENTIRE EDIH Ireland	https://entire-edih.ie

Executive Summary

University College Cork (UCC) and CloudCIX Limited want to revolutionise the efficiency of citizen/customer service for the Irish Government and businesses using ethical, sovereign, agentic, bilingual AI. This document is to inform potential stakeholders and gather support for the project.

UCC and CloudCIX have been working jointly since January 2024 to develop a Machine Learning (ML) toolkit designed to assist in the development of customer service applications for government and private companies. Post graduate students from other colleges have also contributed data sets. UCC and CloudCIX wish to setup a community to manage the further development and commercialisation of the outputs of this research.

· Open to all students, academics, and professionals working in Ireland and interested in the broad field of Artificial Intelligence.

· Government agencies such as Enterprise Ireland (EI) CeADAR, and Údarás Na Gaeltachta will be invited to oversee its operations.

· An organisation will be selected, or a new organisation will be formed, to manage the new larger community group. One possibility is that a subcommittee of Artificial Intelligence Association of Ireland (AIAI) could take on this role of managing this community. Participants in the project can join this group to decide strategy and toolkit roadmap.

· The community will investigate EU-based Large Language Models (LLMs) to select/develop one or more Small Language Models (SLMs – lesser than 30B for ease of deployment) suitable for training for different applications. Data sets can be contributed by the community in return for input in the development of the base LLM/SLM model(s).

· Define national toolkit landscape for government, companies and citizens, including localised question-answering, translation tools, document retrieval and summarisation.

By concentrating on customer service applications in service delivery and using SLMs finetuned to Irish culture and data retrieval systems, concerns about hallucinations and unethical/discriminatory responses can be eliminated.

1 Introduction and Background

1.1 Introduction

“AI holds great promise for the delivery of better Public Services, and there is a strong ambition across the public service to harness trustworthy AI for this purpose, as part of the digital transformation of the Public Service.” (National AI Strategy, 2024 Refresh)

“We (Irish Civil Service) will deliver our public services taking a digital by default approach through collaboration with our stakeholders and the public, building towards the target for 2030 set by the Civil Service Renewal strategy of ensuring that 90% of applicable services are consumed online. In delivering a human-driven experience, we will take a universal design approach, so that the services are available on a 24-7 basis, on the device of choice and are delivered in an equitable, inclusive and sustainable manner.” (Connecting Government 2030)

In alignment with the National Strategies quoted above, this document is a call to form a ‘coalition of the willing’ to establish a sovereign, culturally relevant AI ecosystem that empowers citizens, government agencies, and enterprises. The project’s cornerstone is the development of a set of models and tools called the ‘AI Toolkit’, that can later be used to build secure, language-inclusive, and citizen-focused agentic services. The project is called AI For Ireland (AIFI).

University College Cork and CloudCIX Limited began joint work in January 2024 to develop a bilingual EN/GA LLM and other AI/ML tools and models, for use in Ireland by both government and the private sector, to build safe agentic applications and to ensure data sovereignty.

· UCC and CloudCIX team

· GPU and compute resources have been donated to the project by CloudCIX Limited.

· A list of research achievements to date is included in Appendix 1

· The current state of the AI toolkit can be seen by visiting

o https://www.cloudcix.com/ml.html

UCC and CloudCIX now wish to extend the resources deployed to support this initiative by working with other interested parties on this common goal, hence the publication of this document.

1.2 Language support

English and Irish (Gaeilge) are the official languages of the Republic of Ireland and Northern Ireland.

The Constitution of the Republic of Ireland permits the public to conduct its business – and every part of its business – with the State solely through Irish. As a result, public bodies have a duty to comply with this right (Official Languages Act 2003, Amended 2021).

For these reasons, tools developed under this project must be designed for the provision of bilingual customer service solutions.

2 Vision & Objectives

2.1 Vision

“Artificial Intelligence for Ireland - Government, Academia, Industry and Citizen”

Figure 1 represents an entire software stack that can be used to build customer services applications. This document is a request for cooperation in developing layer 3, the National Toolkit, and the associated training data sets.

Figure 1 AI Customer Services Applications Software Stack

Examples of ML models contained in the AI Toolkit.

· Bilingual LLM(s) finetuned with culturally relevant datasets

· Embedding Models for vector database applications

· EN/GA and GA/EN Translation Models

Examples of non ML tools contained in the Toolkit.

· MyGovID Interaction API

· API/MCP integration tools

2.2 Scope

2.2.1 In Scope

The scope of the project being proposed is to build either an Irish Government regulated and/or open source toolkit that can be used to develop agentic customer service applications.

The key component of the Toolkit will be one or more SLMs that will be fine tuned for EN/GA Agentic Applications. These SLMs must be optimised for use with data returned from retrieval technologies such as API, MCP, JSON-stat and vector databases.

The compliment the SLMs, the toolkit will include models to do tasks such as create and retrieve embeddings, perform EN/GA and GA/EN translation, and non ML tools that will interact with document retrieval systems and Identity Provision Services (IPS) such as MyGovID. The current version of the toolkit is hosted at this URL[1].

2.2.2 Out of Scope

Other than to be used as proof of concepts, this project will not develop ‘end use customer service applications’. It is expected that Government agencies will develop specifications for such applications and go to tender to have them developed. Meeting the requirements of the tendering process will hopefully be easier (and hence less expensive) because of the freely available Toolkit.

· Section 2.3 explains what types of end use applications are under consideration.

· Section 2.4 describes the challenges of such applications and how these challenges can be addressed.

· Appendix 2 contains several prototype applications for illustrative purposes.

2.3 End Use Applications

An initial analysis of potential applications suggests that there are four possible application types and use cases that the National AI toolkit might be used to address. These application types will only give rise to minimal risks and consequently, can be marketed and used subject to the existing legislation without additional legal obligations under the AI Act. Responsibility for GDPR compliance will rest with the developers of the Application rather than with the developers of the National AI Toolkit.

Application Type	Example Use Cases
Unstructured public data distribution. No login required.	Accessing public planning application documents.
Structured public data distribution. No login required.	Accessing CSO public data.
Structured Private data distribution. Login required.	Using MyGovID or other IPS to access personal medical records.
Transactional private data processing. Login required.	Using MyGovID to order a driver’s licence. Using MyGovID to order a passport.

Figure 2 Proposed Application Use Cases

It is important to note that all these proposed use cases are data retrieval systems. It is better not to think of them as Retrieval Augmented Generation (RAG) systems because they are in effect Generation Augmented Retrieval systems. The SLM(s) will be used interpret the request from the end user. The System/User prompts will be engineered to prevent the application from answering any question where the answer is not contained in the underlying external data. This architecture eliminates concerns about hallucinations and unethical/discriminatory responses.

Figure 3 contains a schematic from the Nvidia Research 2024 paper “Small Language Models are the Future of Agentic AI” that deals with the architecture being proposed.

Figure 3 Application Architecture (NVIDIA 2025)

2.4 Objectives & Benefits Summary

Strand	Objectives	Benefits
Cooperation	Encourage multi-stakeholder collaboration between government, academia, and private industry.	Published open research, tools, problem statements, and agendas for advancing bilingual (EN/GA) language technologies. Advancement of Ireland as a centre for AI/ML research and development.
Publication & Authorship	Publish papers and open-source data sets and tools.	Academic institutions can align their research with this common objective.
Language processing skills	Build human knowledge and expertise in Irish for data annotations, validations, and evaluations, especially in language processing applications.	Allow government agencies meet their statutory requirement to offer services bilingually. Increased availability/use of GA will help preserve and grow the language.
Data sets	EN, GA and GA/EN parallel data contributions: pre-collected/processed Irish corpora, curated data (e.g., transcripts, debates, TG4, etc), annotations in various formats (e.g., questions and answers).	A growing corpus of curated data, rich in Ireland’s culture and free of undesirable data to be used to finetune SLMs and LLMs.
Tools	Fine tune one or more LLMs or SLMs to act as a base for agentic customer service applications.	Cost effective citizen service.
Training Infrastructure	To provide HPC infrastructure for research.	CloudCIX Limited has committed to €200,000 of compute credits per annum on the Boole Supercomputer toward development of the Model Zoo.
Inference Infrastructure	Establish sovereign deployment infrastructure creating a foundation for trustworthy, ethical, and transparent AI deployment at scale in Ireland.	Ensure agentic AI applications can be deployed in Ireland with confirmed data sovereignty allowing control over sensitive citizen information.

Figure 4 Objectives & Benefits

3. Governance & Ethical Issues

3.1 Legislative Requirements

The project must be aware of and aligned with a number of Irish/EU legislative requirements.

3.1.1 EU Artificial Intelligence (AI) Act

The EU Artificial Intelligence (AI) Act is an EU regulation which entered into force on 2 August 2024 and is directly applicable across the EU.

The EU AI Act establishes a risk-based regulatory framework for artificial intelligence within the European Union. It classifies AI systems into four categories—unacceptable risk, high-risk, limited risk, and minimal risk—with stricter obligations for higher-risk applications

The Toolkit and underlying applications must align with this legislation.

3.1.2 General Data Protection Directive (GDPR 2016)

This will be an issue for the applications developed using the Toolkit.

3.1.3 Data Governance Act (DGA 2022)

This legislation covers data under a public authority's control, accessed remotely.

3.1.4 European Health Data Space Regulation (EHDS 2025)

For medical record information this may be relevant for applications.

4. Work Schedule

4.1 Challenges

The work of the proposed group will be to address many challenges. This section lists these challenges with some comments. Addressing these challenges will be among the key objectives of the group.

4.1.1 Technical

The Toolkit must deliver on several technical fronts to allow subsequently developed applications to be effective.

1. Build bilingual data sets required to meet challenge 2.

2. Locate base SLMs and fine tune them to:

. Work bilingually in EN/GA.

. Be cognisant of Irish cultural norms and understand question and prepare answers in that context.

. Interface with data sources using MCP, JSON-stat, APIs.

. Interface with unstructured data by means of embedding databases.

. Prevent users from escaping constraints applied by system and user prompts.

3. Build tools that perform tasks such as translation and document embedding.

4. Build tools that perform speech to text and text to speech, bilingually.

4.1.2 Enviromental, Economic & Infrastructural

AI can be power hungry and therefore expensive. It is vital to have strategies that mitigate environmental and economic costs. Not training an SLM from scratch, but rather using pre trained base models and finetuning them in an example of such a strategy. An objective of all the tools developed for the Toolkit is that they can run on minimal hardware. For example, the Mistral 24B SLM can be run on a single Nvidia H100 GPU at a power cost of 1kW. The same SLM can handle queries from multiple applications allowing scaling to occur as required with committing resources that are not utilised.

4.1.3 Social & Ethical

While AI provides many opportunities there are social risks and ethical dilemmas that need to be addressed. However, this project will be focused on customer service applications and by virtue of this narrow focus will avoid many of these challenges.

4.1.4 Legal & Policy

While the project needs to be compliant with Irish and EU legislation including the EU AI Act and GDPR, the narrow focus of the application set being considered limits this exposure.

4.1.5 Practical Deployment Challenges

Ireland must host both the Toolkit and subsequent applications within sovereign data centres.

4.2 Dataset Collection

Datasets of various types are required, including factual information, question–answer pairs, instructions and responses, retrieval-augmented generation (RAG) samples, and tool/function-calling examples. All datasets must be of high quality, truthful, and safe.

A strong focus on Irish-language resources is essential, complemented by carefully filtered English data to ensure cultural and value alignment with Irish society. At present, the available Irish-language data is estimated at under 1 billion tokens, whereas fine-tuning billion-parameter scale language models typically requires trillions of tokens. Addressing this data scarcity will be a core challenge.

While textual data will remain the most important modality, multimodal data (e.g., text–speech, text–image, and conversational interaction datasets) can also play a role, especially in citizen service contexts where accessibility and inclusivity are priorities.

All datasets must be:

· Open-source where possible, to ensure transparency and public trust.

· Legally compliant, respecting copyright, GDPR, and data protection requirements.

· Well documented, to enable reproducibility and accountability.

In parallel with training data, a comprehensive evaluation dataset must be developed to benchmark model performance. Evaluation data should be representative of real-world citizen service needs, such as customer service dialogues, frequently asked questions, and government support scenarios.

Existing Irish resources provide a starting point:

· UCCIX corpus: A collection of raw text data from multiple corpora, filtered for Irish language.

· IrishQA: A benchmark for question–answering tasks in Irish.

· IrishMultiJail: A dataset for testing safety alignment and culturally sensitive refusal behaviour in both Irish and English.

· IRLBench: A benchmark suite for evaluating bilingual performance using the Leaving Certificate Examination.

These resources should be systematically inspected, cleaned, extended, and maintained. Data creation will be required through partnerships with public sector agencies, and collaboration with academia and industry.

4.3 Model Selection & Finetuning

It is proposed to work with EU-based LLMs (e.g., Mistral) to develop one or more SLMs that can be used as base models for fine tuning. The goal is to produce bilingual, culturally grounded, and safe language models and tools optimized for citizen service applications in Ireland.

Several factors need to be considered in model selection and fine-tuning:

1. Data Utilization

- Designing effective pre-training and fine-tuning pipelines to maximize useful information from scarce Irish-language data.

- Exploring data augmentation, synthetic data generation, and cross-lingual transfer learning to scale Irish resources without compromising quality.

2. Bilingual & Culturally Sensitive Adaptation

- Enhancing models to seamlessly switch between Irish and English, supporting code-switching where appropriate.

- Embedding cultural context, local norms, and values into the training process to ensure outputs resonate with Irish citizens.

- Leveraging Irish-specific evaluation benchmarks (e.g., IRLBench) to ensure bilingual robustness.

3. Safeguarding

- Incorporating safety alignment mechanisms, such as instruction filtering, red-teaming in both languages, and agent-in-the-loop evaluation with dataset such as IrishMultiJail.

- Ensuring compliance with EU AI Act principles around transparency, robustness, and accountability.

- Applying continual monitoring and auditing pipelines to track performance drift and prevent misuse.

4. Survey of Literature & Best Practices

- Reviewing state-of-the-art bilingual and low-resource NLP techniques (e.g., parameter-efficient fine-tuning, multilingual transfer, and adapters).

- Learning from similar national AI efforts (e.g., Swiss AI Initiative) to avoid duplication and adapt successful strategies.

Figure 5 Proposed steps to train foundation bilingual Irish-English language model

4.5 Roadmap & Timeline

Date	Milestone	Notes
Sept 1st 2025	Create a draft plan.	Initial circulation to Údarás na Gaeltachta and AIAI to get initial support.
Sept 18th 2025	Publicly announce the community formation and arrange the first meeting of interested parties.	Decide whether this can be part of an existing organisation (preferred) or setup a new organisation to manage the community. Announce to coincide with National AI Meet on Thursday 18th Sept in Galway.
October 2025	Review existing landscape. Access feedback from interested parties.	Perhaps a series of virtual presentations by existing practitioners.
	Publicise the ‘path to success vision’.	Address concerns around ethics, confidentiality and hallucination.
November 2025	Decide on priorities and assign groups to each priority.
November 2025	Publish a website outlining the work being undertaken.	A content platform like Sphinx or a Wiki would allow collaboration on content. We want to avoid the need for a centralised editor.
1st / 2nd December 2025	Announce the launch of the community.	Use the AIAI conference as the initiative launch event.
January 2026	Engage with government and semi state organisations to gather needs and prepare POCs.	Use the information gathered to update this document.
February 2026	Form a committee to define strategy and manage deliverables.

Figure 6 Timeline, first six months.

5. Conclusion

Ireland needs ethical, sovereign, agentic, bilingual artificial intelligence to meet the needs of its citizens and businesses. The proposed community will facilitate discussion and standardisation, as well as accelerate the development of AI frameworks and toolkits. Everyone working in Ireland and interested in the broad field of Artificial Intelligence should be included in this national initiative, fostering long-term and sustainable benefits for the nation, including huge job opportunities, new revenue streams and technological breakthroughs.

Acknowledgement

Tung Tran and Harry Nguyen are supported by the Taighde Éireann–Research Ireland under Grant 18/CRT/6223 and 12/RC/2289-P2, which are co-funded under the European Regional Development Fund.

Appendix 1: Publications and achievements to date.

A1.1 Tung Tran Ph.D. candidate, UCC

[RESEARCH PAPER, MODEL & DATA] UCCIX & UCCIX‑2 – ECAI2024

Llama 2-based Irish LLMs built under extreme low-resource constraints (≈ 10 000 × fewer tokens than English).

Used continued pretraining to transfer world knowledge from a pretrained Llama 2 checkpoint.
IrishQA: 400 bilingual English-Irish multi-choice questions on topics surrounding Ireland.
UCCIX corpus:

500M native Irish tokens, collected from various sources, including almost all online sources
Machine translated data, up to 2B tokens.

[RESEARCH PAPER & MODEL] UCCIX-Translate – ACL2025 Workshop on Technologies for Machine Translation of Low-Resource Languages

We achieved significant improvements in translation tasks, of 36.7% for English to Irish and 133.4% for Irish to English compared to the previous state-of-the-art. This is through relying on UCCIX, with its extensive knowledge of both English and Irish, and additional fine-tuning.
Proposed adaptive layer‑wise fine‑tuning, showing that the first and last transformer layers govern language understanding, enabling efficient training for small languages.

[RESEARCH PAPER & MODEL] UCCIX-Reasoning – Under submission at AAAI2026

Separated reasoning and language by generating Chain-of-Thought (CoT) in English while returning final answers in Irish.
English-pivoted CoT training delivered up to 28.33 % accuracy improvement over baselines and improved user experience.
New insight into the interplay between reasoning and multilinguality in LLMs.

[RESEARCH PAPER & DATA] IRLBench – Under submission at KDD2026 – Datasets and Benchmark Track

A 12‑subject, bilingual Irish-English, multi‑modal benchmark derived from the 2024 Leaving Certificate exam.
Tasks are framed as long‑form generation with the official marking scheme, enabling fine‑grained evaluation of correctness and language fidelity.

[RESEARCH PAPER] Multi-Agent Collaboration Mechanisms: A Survey of LLMs – Under submission at ACM Computing Surveys

We provided an extensive survey of the collaborative aspect of MASs and introduces an extensible framework to guide future research.
Our framework characterizes collaboration mechanisms based on key dimensions: actors (agents involved), types (e.g., cooperation, competition, or coopetition), structures (e.g., peer-to-peer, centralized, or distributed), strategies (e.g., role-based or model-based), and coordination protocols.

[RESEARCH PAPER] Disentangling Language Understanding and Reasoning Structures in Cross-lingual Chain-of-Thought Prompting – EMNLP2025

We provided evidence for the existence of language-specific local reasoning structures and guided the development of more interpretable and effective multilingual AI systems.
We employed neuron intervention and perturbation techniques to analyse and deactivate language-specific reasoning neurons during cross-lingual prompting, leading to performance disparities across languages, up to 27.4%.

[APPLICATION] CloudCIX RAG Workbench

Fully Customizable RAG Chatbot System: Tailor the chatbot experience to meet your unique business needs, from the reference data sources to the LLM behinds the scene. Utilize most advanced techniques to ensure your chatbot delivers factual, relevant, and up-to-date responses based on your internal documentation.
State-of-the-Art Models: Seamlessly integrate with the latest open-source and close-source models, giving you flexibility and control over your chatbot's performance and capabilities, with a unified interface for all models.
Cost-Efficient Pay-As-You-Go Model: Only pay for the resources you actually use. Scale your chatbot systems without worrying about unnecessary expenses. Whether it is small-scale operations or enterprise-level needs, we have got you covered.
Built for Scalability: Our workbench scales effortlessly, ensuring that your chatbot can grow alongside your business without compromising on performance or user experience through our dedicated computing resources.

[APPLICATION] CloudCIX Guiden - Agentic Chatbot

Agentic chatbot, capable of interacting with CloudCIX SaaS platform, helping users through various tasks: creating and managing support tickets, handling supplier and customer transactions, managing ledgers, generating financial reports, and more.
Guiden can be integrate to a SaaS platform through MCP servers, enabling automatic handling of tasks and functionalities of the platform for the users.

Community Engagement

Previously hosted https://aine.chat, an interactive and demonstration of UCCIX, showcasing how state-of-the-art AI models can serve niche, resource-constrained languages effectively.
Thousands of user sessions on the free public demo, demonstrating strong demand for Irish‑centric AI services.

A1.2 Joseph McInerney Research Assistant, Trinity College Dublin

· MSc Artificial Intelligence, Queen’s University Belfast.

· Currently enrolled in C1 Gaelchultúr Irish language course.

[RESEARCH PAPER] Qomhrá: A Bilingual Irish/English Large Language Model – Intention to Submit to LREC2026

Trained Qwen3-8B with the UCCIX data and a subset of the National
Corpus of Irish (https://www.corpas.ie/en/cng/).

Determined that Google Gemini was better than ChatGPT and Claude for Irish.

Translated an instruction-tuning dataset using Gemini to create a 30K bilingual Ga/En dataset.

Created a first-of-its-kind Irish human preference dataset validated by a native speaker.

Visual Overview:

A1.3 McInerney & Tran: 100 years of Irish in the Oireachtas

[DATA] The Dáil Data

Collected more than 100 years of Oireachtas data (Dáil and Seanad), available in various formats: debate, summaries and reports, written questions and answers, etc.
The corpus is a valuable resource into the use of Irish, English, and code-switching between the 2 languages, with a total amount of 1.2B tokens.

Analysis

This work presents an analysis of the Oireachtas corpus covering more than a century of Irish parliamentary proceedings, spanning debates, summaries, reports, and questions and answers. The dataset is bilingual, containing both Irish and English, and is estimated to comprise between 700M and 1B tokens (words).

The figures demonstrate initial analyses already carried out. The first bar chart tracks the total number of tokens by year, showing steady growth in parliamentary text over time, with a sharp increase from the early 2000s onward. The second chart measures the proportion of Irish tokens by year, highlighting fluctuations in the presence of Irish. The third set of pie charts presents topic distributions, comparing overall, English-only, and Irish-only usage. While English debates emphasize Health, Education, and Finance, Irish ones mainly focus on the development of the Irish language itself (Community, Rural and Gaeltacht Affairs, Arts, Tourism, and Culture).

Additional analyses are planned as future work, including event detection, code-switching patterns, sentiment analysis across topics and time, duplication checks against other Irish corpora, and knowledge coverage benchmarking against resources such as Irish Wikipedia and UCCIX. Together, these steps will provide a deeper understanding of Irish usage in parliament and its value for both linguistic research and language model development.

Appendix 2: Simple Example Applications

A2.1 Custom Chatbot for a Website

The current version of the Toolkit can be located at… https://www.cloudcix.com/ml.html

A backend application was developed to utilise these services to build a chatbot for any website and a Chatbot Workbench was developed to create the simple javascript that needs to be embedded in a webpage to bring the application alive. This video shows how simple the workbench is to use…

https://youtu.be/GdQTZc_56UQ?si=C7fiqoWqBzqTGME5

To test out an example of the chatbot visit the following website…

https://energywiseireland.ie/

The instance of the chatbot has access to a vector database that contains information related to renewable energy. The company got permission from SEAI to scrape their website for must of the data. You can test the chatbot to confirm that it does not hallucinate and only answers questions that are related to the underlying corpus.

A2.2 Central Statistics Office

A final year undergraduate student, Barry Halloran, under supervision by Tung Tran completed a final year project for data retrieval for the Central Statistics Office (CSO). This project builds a chatbot to help users access CSO data, which is published as multidimensional data cubes. The CSO provides free, high-quality statistics on society, economy, and environment, and the chatbot makes this information easier to use through natural language queries rather than manual navigation of the data itself.

When a user asks a question (e.g., “What was the employment rate for males in Dublin in 2002?”), the system converts it into an embedding vector and uses FAISS to find the most relevant data cube. Neo4j is then used to explore the cube’s structure, checking available dimensions such as year, region, gender, or statistic. If key details are missing, the chatbot asks clarifying questions before proceeding.

Finally, the system generates a Cypher query to extract the requested data and returns the result conversationally. By combining natural language processing, similarity search, and graph database queries, the chatbot provides an interactive, accurate, and user-friendly way to explore CSO data cubes.

Demo: demo_vid | media.heanet.ie