Welcome toVigges Developer Community-Open, Learning,Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
355 views
in Technique[技术] by (71.8m points)

Data loss on elasticsearch with Kubernetes - indexes are deleted and created automatically

Setup

I'm using elasticsearch:7.9.3 on Kubernetes via Google Kubernetes Engine. The elasticsearch data is persisted using a PersistentVolumeClaim with 20GB. I tested that the PersistentVolumeClaim is set up correctly by deleting and recreating the elasticsearch deployment and checking that data remains available - it did.

Elasticsearch is set up in a minimal way, so by itself, without all that extra stuff like automatic scaling, Kibana (setup locally and independently on my system) and so on. The deployment.yaml looks like this:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: elasticsearch-db
spec:
  selector:
    matchLabels:
      app: elasticsearch-db
      tier: elastic
  template:
    metadata:
      labels:
        app: elasticsearch-db
        tier: elastic
    spec:
      terminationGracePeriodSeconds: 300
      initContainers:
        # NOTE:
        # This is to fix the permission on the volume
        # By default elasticsearch container is not run as
        # non root user.
        # https://www.elastic.co/guide/en/elasticsearch/reference/current/docker.html#_notes_for_production_use_and_defaults
        - name: fix-the-volume-permission
          image: busybox
          command:
          - sh
          - -c
          - chown -R 1000:1000 /usr/share/elasticsearch/data
          securityContext:
            privileged: true
          
        
          volumeMounts:
          - name: elasticsearch-db-storage
            mountPath: /usr/share/elasticsearch/data
        # NOTE:
        # To increase the default vm.max_map_count to 262144
        # https://www.elastic.co/guide/en/elasticsearch/reference/current/docker.html#docker-cli-run-prod-mode
        - name: increase-the-vm-max-map-count
          image: busybox
          command:
          - sysctl
          - -w
          - vm.max_map_count=262144
          securityContext:
            privileged: true
        # To increase the ulimit
        # https://www.elastic.co/guide/en/elasticsearch/reference/current/docker.html#_notes_for_production_use_and_defaults
        - name: increase-the-ulimit
          image: busybox
          command:
          - sh
          - -c
          - ulimit -n 65536
          securityContext:
            privileged: true
      containers:
      - image: elasticsearch:7.9.3
        name: elasticsearch-db
        ports:
          - name: elk-rest-port
            containerPort: 9200
          - name: elk-nodes-port
            containerPort: 9300
        env:
          - name: discovery.type
            value: single-node
          - name: ES_JAVA_OPTS
            value: -Xms2g -Xmx2g


        volumeMounts:
          - mountPath: /usr/share/elasticsearch/data
            name: elasticsearch-db-storage
      volumes:
        - name: elasticsearch-db-storage
          persistentVolumeClaim:
            claimName: elasticsearch-db-storage-claim

---

apiVersion: v1
kind: Service
metadata:
  name: elasticsearch-db
spec:
  selector:
    app: elasticsearch-db
    tier: elastic
  ports:
    - name: elk-rest-port
      port: 9200
      targetPort: 9200
    - name: elk-nodes-port
      port: 9300
      targetPort: 9300
  type: LoadBalancer


---

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: elasticsearch-db-storage-claim
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 20Gi

If I'm not mistaken, this should default to 1 replica. I noticed that there are 7 evicted pods with one running pod that has two restarts.

Issue

My indexes are created manually, and there are two processes that are pumping about 5 to 8 GB of data into this system via python's elasticsearch library. However somehow this data has gone missing. Some recently pushed data is still available, however that was likely pushed onto the server after the issue has occurred.

Most likely cause

I found this in the logs.

2021-01-07 17:45:41.937 CET "SSL/TLS request received but SSL/TLS is not enabled on this node, got (16,3,1,0), [Netty4TcpChannel{localAddress=/10.32.6.14:9300, remoteAddress=/10.32.6.1:58768}], closing connection"
2021-01-08 00:38:00.604 CET "[.async-search/fQ0TsMW-TaKyYA3qpisrMg] deleting index"
2021-01-08 00:38:00.904 CET "[myindex/dreRruz0TxaQJsSi8U_BOw] deleting index"
2021-01-08 00:38:01.254 CET "[read_me/kMNmyfHoT4KAZyajnWLg2A] deleting index"
2021-01-08 00:38:01.524 CET "[read_me] creating index, cause [api], templates [], shards [1]/[1]"
2021-01-08 00:38:01.904 CET "[read_me/LaGFr3GcR4Gy-by2LUaiaw] create_mapping [_doc]"
2021-01-08 00:38:19.811 CET "[myindex] creating index, cause [auto(bulk api)], templates [], shards [1]/[1]"
2021-01-08 00:38:20.071 CET "[myindex/iQHwXEgKQb6F2Ur8JfFNIw] create_mapping [_doc]"
2021-01-08 02:30:00.002 CET "starting SLM retention snapshot cleanup task"
2021-01-08 02:30:00.003 CET "there are no repositories to fetch, SLM retention snapshot cleanup task complete"
2021-01-08 02:38:00.004 CET "triggering scheduled [ML] maintenance tasks"
2021-01-08 02:38:00.004 CET "Deleting expired data"
2021-01-08 02:38:00.005 CET "Completed deletion of expired ML data" 
2021-01-08 02:38:00.006 CET "Successfully completed [ML] maintenance tasks"
2021-01-08 04:14:35.547 CET "[.async-search] creating index, cause [api], templates [], shards [1]/[1]"
2021-01-08 04:14:35.553 CET "updating number_of_replicas to [0] for indices [.async-search]"

My understanding is, that deleting an index removes the data as well (even if there's a way to restore it, that's not my concern).

In researching this issue I've come across the meow attack on elasticsearch databases. This attack essentially deletes the indices and creates random string indices with a -meow ending instead (because cats like to drop stuff, in this case database tables).

green  open .kibana-event-log-7.9.3-000001 ulnmulwTSzmZ2vi6FY6NGg 1 0     4     0  21.6kb  21.6kb
yellow open read_me                        LaGFr3GcR4Gy-by2LUaiaw 1 1     1     0   4.9kb   4.9kb
green  open .kibana_task_manager_1         Q_ud7vO2RN6ImILgNwS4iQ 1 0     6 14071   1.4mb   1.4mb
green  open .async-search                  maKtb69bS-WCQQSTTOTJ4Q 1 0     0     0   3.3kb   3.3kb
green  open .kibana_1                      XqDgNGuJTzyDeVLyVaM_eQ 1 0    45     1 546.4kb 546.4kb
yellow open myindex                        iQHwXEgKQb6F2Ur8JfFNIw 1 1 30229 25118    27mb    27mb

My current indicies looks somewhat similar, but I there's random string and no meow ending there. Could this be a variation of that attack? Admittedly it was more important to get this running before the Christmas holidays rather than properly securing this, although that's what I'm working on right now.

If this is indeed the issue, then just turning on authentication should fix this, right?

Potential other causes

At some point while I was pushing data to the server for a week or two, I received this error on the pushing side:

elasticsearch.exceptions.TransportError: TransportError(429, 'cluster_block_exception', 'index [myindex] blocked by: [TOO_MANY_REQUESTS/12/disk usage exceeded flood-stage watermark, index has read-only-allow-delete block];')

That error seems to have several possible causes, one of them is using up all of the allocated storage, but my understanding is, that this should make the data read-only, not remove most or all of it. So that might or might not be the cause of my issues. Similarly, going through the logs I see several of these warnings on the server as well.

There are several java.lang.OutOfMemoryError and generally a lot of stack traces as well. I believe this triggered a restart of elasticsearch, but probably didn't cause the issue.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)
等待大神答复

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to Vigges Developer Community for programmer and developer-Open, Learning and Share
...