Thanks its "hypergrowth", the future French unicorn in the tourism sector saw its data volumes grow up to 1.5 TB in 2019. Analysts deliver and present BI reports to decision boards in more than a week, without any certainty that shown numbers are accurate. We implemented a datalake that delivers aggregated data collections according to collected business needs so that analysts can deliver reports in 4h.
The startup's selling point is custom-made and unique trips for travellers. To do so, it relies on big data volume from various sources (data collected from agents and travellers exchanges, payment data, users data, website sessions data).
Sicara is a driving force on technical issues: they helped us choose the technology stack that fitted our business ambitions and shared their ETL best practices. Furthermore, Sicara's teams perfectly fitted into our in-house data and business teams and helped us with lean methodology implementation.
We developed a datalake with its historical data and up-to-date data to be delivered in real time. Thi data is either stocked or aggregated from 7 different sources in response to pre-identified business needs. First, product and marketing teams use raw data (website user sessions data, clients recommendations, sales data) in order to enhance user experience. Secondly, BI teams rely on data aggregated according to specific business rules that allow them to deliver BI reports in less than 4 hours to the executive team and internal teams (compared to 1 week beforehand).
We implemented an EL-ETL that organized 1TB historical data, hence more than 100.000 millions documents were aggregated in 22 collections. Once aggregated, these collections are provisionned in real-time by a dual system of 2O RabbitMQ workers that manage this data and 21 RabbitMQ workers that update this data. Furthermore, the data engineers team updated the PostgreSQL architecture to make it fluid and scalable in order to adapt to the startup ever-changing needs.
Related articles written by Sicara data scientists
Automate AWS Tasks Thanks to Airflow Hooks
This article is a step-by-step tutorial that will show you how to upload a file to an S3 bucket thanks to an Airflow ETL (Extract Transform Load) pipeline
How Apache Airflow Distributes Jobs on Celery workers
The life of a distributed task instance
How to Get Certified in Spark by Databricks?
This article aims to prepare you for the Databricks Spark Developer Certification: register, train and succeed, based on my recent experience.