[Druid] ingestion spec for csv file (csv 파일로 ingestion)

Notice

Recent Posts

Recent Comments

Link

« 2026/04 »
일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30

Tags more

Archives

Today

Total

관리 메뉴

인턴기록지

[Druid] ingestion spec for csv file (csv 파일로 ingestion) 본문

Druid/Project

[Druid] ingestion spec for csv file (csv 파일로 ingestion)

인턴신분현경이 2020. 11. 26. 14:15

(druid 0.19.0)

드루이드 다큐먼트는 json파일 포맷으로만 예시가 들어있어

csv file을 ingestion 할때 매우 애먹었다.

parser를 spec과 firehose spec , inputFormat 여러가지 설명들이 있어지만 이것들을 조합해서 사용한 전체적인 예시가 없어서 조금만 틀려도 ingestion failed 가 떳기 때문에 상당히 힘들었다.

해결책은 매우 간단했다. inputFormat 하나만 작성해주면 (parse이런거 신경 쓸거 없이) 알아서 촥촥 들어간다.

드루이드 다큐먼트를 보면

이런 식으로 예제가 나와있다. 나도 예제처럼 columns 를 입력해서 해주었지만 잘 되지 않았다.

결국 columns를 적는 대신에 findColumnsFromHeader 를 사용해주었다.

"inputFormat" : {
          "type" : "csv",
          "findColumnsFromHeader" : true
        }

위와 같이 적으면 txt파일의 맨 위 헤더를 찾아서 알아서 컬럼으로 지정해서 넣어주기 때문에 컬럼들을 작성해줄 필요가 없다.

전체적인 csv ingestion 코드

{
    "type" : "index_parallel",
    "spec" : {
      "dataSchema" : {
        "dataSource" : "ewl_bicycle",
          "timestampSpec": {
            "column": "year",
            "format": "auto",
            "missingValue" : "2014-06-04T00:00"
          },
          "dimensionsSpec" : {
            "dimensions" : [
              "bcycl_no",
              "type"
            ]
          },
          "metricsSpec" : [],
          "granularitySpec" : {
            "type" : "uniform",
            "segmentGranularity" : "month",
            "queryGranularity" : "minute",
            "rollup" : false
          }
      },
      "ioConfig" : {
        "type" : "index_parallel",
        "inputSource" : {
          "type" : "local",
          "baseDir" : "quickstart/tutorial/",
          "filter" : "ewl_bicycle_test.txt"
        },
        "inputFormat" : {
          "type" : "csv",
          "findColumnsFromHeader" : true
        },
        "appendToExisting" : false
      },
      "tuningConfig" : {
        "type" : "index_parallel",
        "maxRowsPerSegment" : 5000000,
        "maxRowsInMemory" : 25000
      }
    }
  }

ewl_bicycle_test.txt

bcycl_no,year,type
SJ-0460,,어울링
SJ-0461,,어울링
SJ-0462,,어울링
SJ-0463,,어울링
SJBIKE_00022,2018-06-26 11:30,뉴어울링
SJBIKE_00025,2018-06-26 11:30,뉴어울링
SJBIKE_01622,2019-11-28 15:21,뉴어울링
SJBIKE_01623,2019-11-28 15:21,뉴어울링
SJBIKE_01624,2019-11-28 15:21,뉴어울링

짜잔 성공~!

postgresql 의 데이터를 드루이드에 ingestion 하고 싶어서 다큐먼트를 뒤져보니 가능하다고 써있어서 시도해보았다.

대 실 패

자꾸만 밑의 오류가 난다. 밑의 SQLFirehoseDatabaseConnector 이런 거 등등이 문제인거 같은데

apache-druid-0.19.0 bin 파일말고 src로 실행하면 잘 ingestion이 됐다는 길리더님 서버........

나도 개발서버의 도커에 src파일로 올려서 실행했지만 실패,, 내가 뭘로 하던가에 다 실패,,,

Exception: HTTP Error 400: Bad Request, check overlord log for more details.
{"error":"Could not resolve type id 'postgresql' as a subtype of `org.apache.druid.metadata.SQLFirehoseDatabaseConnector`: known type ids = [] (for POJO property 'database')\n 
at [Source: (org.eclipse.jetty.server.HttpInputOverHTTP); line: 31, column: 19] (through reference chain: org.apache.druid.indexing.common.task.batch.parallel.ParallelIndexSupervisorTask
[\"spec\"]->org.apache.druid.indexing.common.task.batch.parallel.ParallelIndexIngestionSpec[\"ioConfig\"]->org.apache.druid.indexing.common.task.batch.parallel.ParallelIndexIOConfig
[\"inputSource\"]->org.apache.druid.metadata.input.SqlInputSource[\"database\"])"}

그래서 폴더 안을 뒤져봤다.

폴더 들어가기 전 .jar 파일들 압축해제 해줌

cd extensions/postgresql-metadata-storage/org/apache/druid
[root@localhost druid]# ls
firehose  metadata
[root@localhost druid]# cd firehose/
[root@localhost firehose]# ls
PostgresqlFirehoseDatabaseConnector.class

[root@localhost druid]# cd metadata/
[root@localhost metadata]# ls
storage
[root@localhost metadata]# cd storage/
[root@localhost storage]# ls
postgresql
[root@localhost storage]# cd postgresql/
[root@localhost postgresql]# ls
PostgreSQLConnector$1.class  PostgreSQLConnectorConfig.class        PostgreSQLTablesConfig.class
PostgreSQLConnector.class    PostgreSQLMetadataStorageModule.class

구글링 해보니 이 사람이 나랑 제일 비슷 하지만 이사람은 src 드루이드이다... (나는 bin)

github.com/apache/druid/issues/7874

PostgresqlFirehoseDatabaseConnector not bundled · Issue #7874 · apache/druid

Affected Version 0.14.2-incubating Description In spite of including postgresql-metadata-storage in druid.extensions.loadList, when I try to run a SQL index task with a PostgreSQL source, I get the...

github.com

github.com/apache/druid/blob/0fa90008496926c15426710b0dd4698bdc224bac/extensions-core/postgresql-metadata-storage/src/main/java/org/apache/druid/firehose/PostgresqlFirehoseDatabaseConnector.java

apache/druid

Apache Druid: a high performance real-time analytics database. - apache/druid

github.com

나도 이 파일을 찾아서 비교해보니 똑같~

.class 파일을 dedompile 해주는 jd-gui 사용해서 코드 확인

뭐가 문제인거지,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

Native batch ingestion · Apache Druid

'Druid > Project' 카테고리의 다른 글

[Druid] post-index-task-main.py 수정 (최종본!) (0)	2020.11.26
[Druid] 400만개 데이터 ingestion (csv spec) (0)	2020.11.26
(인터넷연결없이) jdk8 & druid install (0)	2020.09.24
python2 code convert to python3 (0)	2020.09.21
[Druid] post-index-task-main 코드 분석 (0)	2020.09.21

'Druid/Project' Related Articles

인턴기록지

[Druid] ingestion spec for csv file (csv 파일로 ingestion) 본문

[Druid] ingestion spec for csv file (csv 파일로 ingestion)

'Druid > Project' 카테고리의 다른 글

티스토리툴바